4 Results and evaluation

advertisement
BAR-ILAN UNIVERSITY
Seq_align : A Parsing-Independent
Bilingual Sequence Alignment Algorithm
EHUD S. CONLEY
Submitted in partial fulfilment of the requirements for the Masters Degree
in the Department of Computer Science, Bar-Ilan University
Ramat-Gan, Israel
2002
ACKNOWLEDGEMENTS
In the first place, I would like to express my deep thanks to Dr. Ido Dagan, who advised and guided me towards the completion of this research. I am convinced that the
qualities of both the results and the dissertation itself have been substantially contributed by his sharp and deep insight.
I would like to hereby acknowledge Prof. Jean Véronis of the University of
Marseille, France, who provided me with the full data of the ARCADE project and
has been ever ready to supply any helpful piece of information. Many thanks as well
to Mr. Michel Simard of the University of Montreal, Canada, and to Mr. Éric Gaussier
and Mr. David Hull of the Xerox Research Centre Europe, Grenoble, France, for their
kind and fruitful correspondence with me regarding their works. I would also like to
acknowledge Prof. Achim Stein of the University of Stuttgart, Germany, for his aid in
the issue of lemmatisation.
Finally, I would like to appreciate Prof. Shemuel (Tomi) Klein and Prof. Amihood Amir of the Department of Computer Science in Bar-Ilan University, who contributed of their time, experience and patience to help me with certain important aspects of the algorithms’ implementation. I am also grateful to my colleagues Zvika
Marx and Yuval Krymolowski, who have been always glad to assist in any possible
manner.
TABLE OF CONTENTS
ABSTRACT ............................................................................................................................................ 4
1
INTRODUCTION ......................................................................................................................... 6
2
BACKGROUND AND RELATED WORK ............................................................................... 9
3
4
5
2.1
MACHINE-AIDED HUMAN TRANSLATION ................................................................................ 9
2.2
DICTIONARY INDUCTION AND WORD ALIGNMENT ................................................................ 12
2.3
MULTI-WORD UNIT ALIGNMENT ........................................................................................... 15
2.4
THE WORD_ALIGN ALGORITHM ............................................................................................ 19
THE SEQ_ALIGN ALGORITHM ........................................................................................... 24
3.1
THE EXTENDED MODEL ........................................................................................................ 24
3.2
CANDIDATE SELECTION ........................................................................................................ 27
3.3
IMPROVEMENT OF THE DICTIONARY ..................................................................................... 28
RESULTS AND EVALUATION ............................................................................................... 35
4.1
EVALUATION METHODOLOGY .............................................................................................. 35
4.2
THE TEST CORPUS................................................................................................................. 37
4.3
DICTIONARY EVALUATION ................................................................................................... 42
4.4
ALIGNMENT EVALUATION .................................................................................................... 48
CONCLUSIONS AND FUTURE WORK ................................................................................ 58
REFERENCES ..................................................................................................................................... 60
3
ABSTRACT
This dissertation presents the seq_align algorithm, an extension of the word-level
alignment word_align algorithm (Dagan, Church & Gale, 1993). The word_align algorithm tries to find the optimal alignment of single words across two parallel texts,
i.e. a pair of texts where one is a translation of the other. The seq_align algorithm is
intended to do the same for both single- and multi-word units (MWUs). With a difference from other methods for MWU alignment, seq_align does not assume any syntactic knowledge. Rather, it uses statistical considerations as well as primitive lexical criteria to select the candidate word sequences.
The basis of the extension to multi-word sequences is the view of each text as
the set of all word sequences existing within this text, up to a pre-specified maximal
length. The probability of a target-language candidate to be a translation of a given
source-language candidate is measured through an iterative process similar to that performed by word_align, but this time all candidate sequences (rather than just single
words) are judged. An additional set of length probabilities is integrated into the process as to take into consideration the length relations between the matching sourceand target-language sequences.
In practice, most candidate sequences are invalidated in advance according to
a set of simple rules, using only short lists of the function words in each language.
This cleanup phase enables handling a reasonable amount of candidates as well as focusing on the more significant ones.
The output of the iterative process is a probabilistic bilingual dictionary of
multi-word sequences, which can be used to align the parallel text at the sequence
level. However, this dictionary is quite noisy, especially because of the incapability of
the statistical model to choose between sequences where one is a sub-sequence of the
4
other. A heuristic noise-cleaning algorithm is suggested to overcome this obstacle.
The existence of redundant affix function words is approached through an additional
elementary algorithm.
The experimental results are based on the data of the ARCADE project (Véronis & Langlais, 2000). The evaluation of these results shows that a syntaxindependent algorithm can yield a highly-reliable bilingual glossary. Another finding
of the research is that both the word_align and the seq_align algorithms, based on a
directional statistical model, cannot do as well as the Xerox method (Gaussier, Hull &
Aït-Mokhtar, 2000), which is based on a non-directional model. However, it seems
that the principles of the seq_align extension are applicable to a method of this kind as
well.
5
1
INTRODUCTION
According to the Merriam-Webster online English dictionary, the term wind farm was
created on 1980. Imagine a translator at the end of 1980, trying to translate a new
English document, dealing with wind-activated electrical generators, into the French
language. Even the most comprehensive and updated English/French dictionary,
printed on the same year, could not provide him with the commonly-accepted French
parallel of wind farm.
It is not inconceivable to suppose that our imaginary fellow, who was not a
great expert in electricity, had furnished himself with at least one previously-created
English/French pair of parallel documents on a related topic. In that situation, he is
beginning to scan the English version of such a document in order to find an occurrence of the source term, wind farm. When he finally finds what he was seeking for,
he has to read the parallel section of the French version in order to reveal that a wind
farm is termed in French station éolienne. Apparently, he could have never guessed
that.
Even if the parallel text is available electronically and is already aligned on the
paragraph- or sentence-level, a repetition of this process for dozens of terms involves
a high consumption of time and energy. Evidently, it could be much easy and efficient
if a glossary of the electricity domain was at hand. A collection of parallel text segments aligned at the term level and accessible through the glossary’s entries could further help the translator in understanding the context in which each term is used. Within a computerised translation-aid system, clicking on the glossary entry wind farm and
choosing station éolienne, the translator may be shown the following pair of text segments:
6
The Community's position vis-à-vis the
Jandía wind-farm
Position de la Communauté européenne
concernant la station éolienne de Jandía
Figure 1: Parallel text segments where the terms wind farm and station éolienne appear as mutual translations. The
aligned terms are highlighted in a bolded font. The text is an excerpt from the JOC corpus of the Parliament of the
European Union
In order to be able to supply such resources for many domains while staying
up-to-date, automatic induction tools are needed. Such tools have been developed
since the early 1990s, based on various statistical measuring techniques. However, the
recognition of the boundaries of multi-word terms has always been done using language-specific tools for syntactic parsing, which have knowledge about the typical
structures of meaningful multi-word units (MWUs) in the discussed language.
Though these techniques have yielded quite good results, they are not easily
portable to other languages for which syntactic parsing is not available in a sufficient
quality. Additionally, parsers sometimes do not include the definitions of rare structures, which necessitate the use of a pre-specified list of exceptions.
Consequently, arriving at a point where a syntax-independent algorithm can
provide results of a quality similar to that attained using parsing has remained a valuable objective. The current dissertation presents the seq_align algorithm, a purely statistical method for MWU alignment, based on the prominent word_align algorithm
(Dagan, Church & Gale, 1993). As an algorithm for word-level alignment,
word_align per se cannot supply a MWU glossary. Hence, its output should always be
processed using the output of a monolingual parser, as done in the Termight method
(Dagan & Church, 1994, 1997). The new seq_align algorithm works on the sequence
level, rather than the word level, testing together single- and multi-word sequences in
order to identify their counterparts.
Quite surprisingly, this novel method has managed to produce a rather highquality glossary, having an average entry coverage rate of 90% and above 71% of en-
7
tries supplying a full translation. Furthermore, almost all of the entries can be used to
reach the full translation through a corresponding detailed sequence-level alignment.
The quality of the detailed alignments of both word_align and seq_align has
been evaluated using the evaluation data of the ARCADE project (Véronis & Langlais, 2000). Both algorithms achieved the same average grade of 54% (computed using the F-measure), though there have been slight differences in terms of precision
and recall. This shared quality is lower than those achieved by some of the systems
which originally participated in the ARCADE project, but is still of an acceptable level. The capabilities of the algorithms in term-level alignment have been tested on a
subset of the ARCADE sample, indicating a similar quality.
The dissertation consists of four principal parts: Chapter 2 gives a survey of
background and related work, including a full description of the original word_align
algorithm. Chapter 3 describes the seq_align algorithm itself as well as the methods
used for the selection of candidate sequences and for the improvement of the dictionary’s quality. Chapter 4 presents the evaluation of the results, both for the glossary
and for the detailed alignment. Finally, Chapter 5 discusses the conclusions of the research and points to potential future work.
8
2
BACKGROUND AND RELATED
WORK
This chapter is divided into four sections. Section 2.1 is a general survey of machineaided human translation. Section 2.2 is an overview of methods for automatic dictionary induction and detailed alignment. Section 2.3 discusses the problem of multi-word
unit alignment and the different approaches towards its solution. Finally, Section 2.4
gives a succinct description of the word_align algorithm (Dagan, Church & Gale,
1993), which is the concrete basis of the currently proposed seq_align algorithm.
2.1
Machine-aided human translation
The task of translating a text from one language to another has been known as problematic since ancient days. One of the major difficulties in its performance is in finding the most suitable terminology within the discussed context and using it properly.
Specialised terms, commonly found in technical documents, should be translated not
only suitably but also consistently. In the past, bilingual glossaries (i.e. collections of
specialised terms with their translations) did not exist for most areas, not to mention
organised collections of translation examples. Thus translators had to spend many
hours in reading and searching over relevant texts in both source and target languages
in order to extract the correct terms and study their proper usage.
In the last few decades, there has been a significant increase in the variety of
domains of translated texts as well as in the number of language pairs dealt with. Additionally, the quantities of materials to be translated have been growing in an accelerated pace. All these factors motivated serious efforts to develop automatic tools capable of inducing bilingual glossaries and supplying relevant translation examples.
9
Imitating human learning, these tools try to deduce linguistic knowledge from
bilingual parallel texts, i.e. pairs of texts where one is a translation of the other. For
instance, given an English/French text discussing computers hardware, a glossaryinduction tool is expected to find out that Random Access Memory, appearing in the
English text, is translated in the French text as mémoire vive. That is while random
and access by themselves are normally translated into aléatoire and accès, respectively.
Given a pair of word sequences, a bilingual concordancing tool would display
some parallel text segments (for instance, sentences or paragraphs) where these sequences appear and are likely to translate each other. The concordance should also be
able to indicate other correspondences between single- or multi-word units within
each pair of aligned segments.
The task of identifying high-level segment correspondences is referred to as
rough alignment, whereas that of mapping word-level connections is called word
alignment or, more generally, detailed alignment. Figure 2 is an example of a partial
detailed alignment.
During the past decade, significant efforts have been invested in the development of algorithms for automatic induction of bilingual glossaries as well as for both
levels of alignment. Aside from its importance per se, rough alignment is required for
the glossary induction and the detailed alignment algorithms in order to function well.
That is because these algorithms are all based on statistical methods which are very
sensitive to slight deviations.1 Therefore, they need a suitable rough alignment to focus them on relatively short parallel text-zones.
1
A linear alignment could have been used as rough alignment, meaning that the expected parallel of
the ith word of text S would be the jth word of text T such that j = LT / LS  i, where LS and LT denote the
10
Je suis convaincu que chacun de mes collègues présents aujourd'hui à
I believe that all of my colleagues presently sitting in
--la Chambre des communes aimerait avoir l'occasion de proclamer haut
the House of Commons would like a chance to individually go on record
--et fort, devant ses électeurs canadiens, sa fierté d'être
and officially tell their constituents that they are proud to be
--Canadien.
Canadians.
Figure 2. An example of a partial English/French detailed alignment. The upper line in each line-system consists of
the French text, while the lower line contains the English parallel. The arrows indicate the correspondences between the two texts. The text was excerpted from the Canadian Hansards bilingual corpus which documents the
debates of the Canadian parliament.
In many cases, texts are translated sentence-by-sentence, keeping also the
original sentence order. Some parallel texts of that type even include sentence alignment mark-ups inserted during the translation process. When such a partition is not
indicated explicitly, it can be obtained rather easily by applying one of the many accurate algorithms developed for automatic sentence alignment (for example, Kay &
Röscheisen, 1988, 1993; Church & Gale, 1991; Gale & Church, 1993; Brown, Lai &
Mercer, 1991; Debili & Sammouda, 1992; Simard, Foster & Isabelle, 1992; Haruno &
Yamazaki, 1996, 1997; Johansson, Ebeling & Hofland, 1996). Sentence alignment is
considered a high-quality initial rough alignment for the detailed alignment/glossary
induction algorithms.
In some other cases, however, this level of parallelism does not exist, either
due to the translator’s preferences or because of different natures of the two languages
aggregate lengths of S and T, respectively. In fact, the real translations of words in parallel texts deviate
from this diagonal. Alternatively assuming a large search space to overcome these deviations significantly lowers the accuracy of detailed alignment algorithms.
11
in terms of grammar and style. In such cases, a rather satisfactory substitute for sentence alignment might be a set of highly reliable pairs of matching word occurrences,
referred to as anchor points, which can be deduced from an unaligned parallel text
(see for example Melamed, 1996, 2000; Fung & McKeown, 1997; Choueka, Conley
& Dagan, 2000).
Indeed, glossary induction and detailed alignment algorithms are strongly correlated. On one hand, any detailed alignment algorithm is based on a suitable bilingual
dictionary (see below). On the other hand, a bilingual glossary can be compiled rather
simply using the local connections indicated by a detailed alignment output. Therefore, bilingual detailed alignment and glossary induction are concerned as highly dependent tasks.
2.2
Dictionary induction and word alignment
Naturally, a detailed alignment algorithm must utilise a bilingual dictionary corresponding to the aligned text pair in order to determine the most probable translation of
each textual unit. As bilingual glossaries are rarely available, this information must be
induced by the systems themselves. Different authors proposed various techniques for
the acquisition of these lexicographic data. These methods can be divided into two
sorts: (a) Single-pass measurements, and (b) Iterative processes.
Single-pass measurements means applying certain statistically-based measures
to the counts of unit occurrence and co-occurrence within the text pair, obtained by
scanning the text pair once. According to the acquired counts, a relative score is assigned to each pair of units (words or phrases), such that corresponds to the likelihood
of one unit to be a valid translation of the other. The measures used by such methods
include the well-known Mutual Information measure (Cover & Thomas, 1991), the
Dice score (Smadja, 1992) and the T-socre (Fung & Church, 1994). The Linköping
12
Word Aligner (LWA) (Ahrenberg, Andersson & Merkel, 2000) is an example for an
alignment system based on this type of measurement.
The iterative processes are actually specific bootstrapping algorithms, in
which translational equivalence is measured by repetitively scanning the text pair,
each time calculating rectified estimates using those attained at the end of the previous
iteration. Each of these algorithms is based on a certain statistical model, which relates to the parallelism existing between the two parts of the bilingual text. Current
algorithms are variants of one of the following basic models:
IBM’s statistical models, suggested by Brown et al. (1993). The RALI sys-
1.
tem, which participated in the ARCADE project (see below in Subsection 4.4.1),
used lexical data acquired through IBM’s Model 1.2 A modification of Model 2,
named word_align, was suggested by Dagan, Church & Gale (1993), and is the
basis of the current research (see a detailed description of word_align in Section
2.4).
2.
Hiemstra’s model (Hiemstra, 1996), adopted by Melamed (1997b) and by
Gaussier, Hull & Aït-Mokhtar (2000).
The fundamental difference between these two model families is in the aspect
of directionality. The IBM models associate unequal roles to the two halves of the bilingual text. One half, referred to as the target text, is assumed to have been generated
from the other text, regarded as the source text. Each target-text word, t, is assumed to
have been produced by at most one source-text word, s. This does not avoid s from
being the origin of other target words. Hence, IBM models are considered directional
models. As opposed to these assumptions, Hiemstra’s model relates to the two text
halves symmetrically, enabling each word in one text to be aligned with at most one
2
Personal communication.
13
word in the other text. This non-directional assumption is also referred to as the oneto-one assumption. Cases of one-to-many (and many-to-many) alignments are treated
only at the alignment phase (see below in Section 2.3).
Once the bilingual dictionary was induced through one of these methods, a detailed alignment may be generated. All the algorithms mentioned above share the
principle of choosing the best translation for each unit considering the acquired dictionary. Some of them first produce all possible alignments accompanied by their relative scores. Then, the best pairs are iteratively chosen while eliminating other pairs
where either the source or target unit appears (e.g. Gaussier, Hull & Aït-Mokhtar,
2000; Ahrenberg, Andersson & Merkel, 2000). Other methods simply connect the
best source unit to each target unit (Brown et al., 1993; Dagan, Church & Gale, 1993),
thus preserving the option of one-to-many alignment.
One way or the other, a pre-defined threshold is always applied to filter out
lowly-scored, usually erroneous connections (i.e. pairs of aligned units). Additionally,
certain positional probabilities are estimated and integrated within the score of each
candidate connection in order to take into consideration also the relative location of
the two units (as indicated by the initial rough alignment).
The translation of function words (i.e. words other than nouns, verbs, adjectives and adverbs) is generally inconsistent. Therefore, ignoring these units already at
the dictionary induction phase is a common practice of most methods. Furthermore,
low-frequency phenomena are sometimes more confusing than helpful for the process,
thus each method has its own criteria for their elimination.
As in any text processing, raw versions of parallel texts must be adjusted before the initiation of the bilingual process. The first pre-processing step, common to
both monolingual and bilingual tasks, is tokenisation, i.e. separating punctuation
14
marks and other symbols from the words with which they are concatenated. The resulting blank-delimited units are referred to as the text’s tokens. The term words
within the context of parallel text processing regularly means the texts’ tokens determined by this handling.
Another important treatment is a certain level of monolingual morphological
analysis through which different inflections of the same word are replaced by a common base form. Some systems apply a full morphological analysis or lemmatisation
(see Choueka, Conley & Dagan, 2000), while others satisfy with a heuristic stemming
(e.g. Ahrenberg, Andersson & Merkel, 2000). These preliminary operations unify the
various forms of basically identical words observed within the text, thus improving
the algorithm accuracy.
2.3
Multi-word unit alignment
The easiest case of detailed alignment is where each word of the source text is translated into exactly one word in the target text. A slightly harder case is where a few
source words are not translated at all, whereas some target words have no specific
“origin”. In these two cases the one-to-one assumption is valid, hence there is no need
to extend or modify any of the basic methods mentioned above (in Section 2.2) in order to accomplish fairly good results.
Real-life texts, however, do not completely conform with this assumption.
Sometimes a sole word is translated into a few words or vice versa, and moreover,
some source-language multi-word units (MWUs) are translated non-literally into target-language MWUs. Therefore, the correct modelling of the problem is feasible
when the words of the non-literally translated MWUs are fused into an atomic unit.
A further complexity of the problem is that not infrequently a MWU is not
successive within the text. Rather, additional words, such as adjectives, adverbs and
15
pronouns, are often inserted between the basic words of an expression (e.g. centre national de recherche [national research centre]). Idiomatic expressions are sometimes
even divided into several sub-units appearing at relatively distant parts of the sentence. The order of the words might also vary due to stylistic or grammatical considerations (e.g. conversion to the passive voice). This variety of potential modifications
make it very complicated to recognise the basic equivalence of different forms of one
expression.
Trying to process all possible combinations of words, even within sentence
boundaries, would increase the complexity of the problem to exponential in the length
of the text (or sentence) while probably avoiding any significant statistical data.
Therefore, most researchers concluded that it is essential to apply a certain level of
syntactic analysis on the text in order to identify the most likely candidate MWUs and
recognise their skeletons (principal words). This type of analysis is commonly considered a language-specific monolingual procedure. That is because the structures of expressions in each language aree substantially different. Hence, a cross-language generalisation is apparently impossible.
As not every incidental phrase is likely to be an idiom, some methods combine
the linguistic analysis with some statistical measures. The linguistic component in all
methods is based on the output of shallow parsers, implemented through manuallydefined regular expressions or local grammars (Jacquemin, 1991; Bourigault, 1992;
Smadja, 1993; Daille, 1994). This sort of techniques was applied with some degree of
success to bilingual alignment or glossary induction (e.g. Daille, Gaussier & Langé,
1994; Smadja, McKeown & Hatzivassiloglou, 1996; McEnery, Langé, Oakes & Véronis, 1997).
16
The MWU mark-ups can be integrated either into the dictionary induction or
the detailed alignment phase. The former option might be interpreted in two manners:
(a) When a candidate MWU is suggested, the whole MWU is considered while ignoring the constituent words, or (b) The candidate MWUs is considered in addition to the
single-word tokens within the text.
According to the first definition, if the translation of the fused MWU was
compositional (literal), the statistics of the elemental words or sub-units are distorted.
The second definition, however, suggests that the decision whether a MWU is translated literally or non-literally will be a consequence of the dictionary induction process. That is, the most probable translation for each token, either a single- or a multiword unit, will be chosen. Nevertheless, judging additional candidates might lower
the precision of the induction process.
The latter approach was fully adopted by Ahrenberg, Andersson & Merkel
(2000). Melamed (1997a) suggested a specific bootstrapping algorithm for the recognition of non-compositional compounds within a parallel text. This algorithm intends
to solve the problem raised above regarding the former approach. It applies the Mutual Information measure to compare different versions of the texts. Practically, this involves a somewhat full modelling of the text parallelism, which could have maybe
been exploited for the induction of a full dictionary.
As stated above, it is feasible to integrate the preliminarily-detected MWUs
just at the detailed alignment phase, meaning that the dictionary is induced solely for
single words. The MWUs are aligned according to the equivalence of one or more
pairs of elemental tokens, as determined by the single-word alignment algorithm. This
technique is quite suitable for language pairs where most expressions are translated
literally (i.e. where the one-to-one assumption is valid) as in English/French. None-
17
theless, it cannot do very well in cases of non-literal translation, not to say when the
languages tend to frequently use multi-word idioms (like in Chinese and Hebrew),
which are rarely translated literally into other languages.
The Termight method, suggested by Dagan & Church (1994, 1997), utilises
the output of the word_align algorithm to find translations of noun-phrase terms. The
candidate source-language terms are proposed by a simple parser and then refined by
a human judge. The suggested translations are simply the concatenation of the words
residing between the leftmost and rightmost translations of the source phrase elements. Intermediate function words, if any, are therefore included automatically within the translation, as well as possible other inserted words, which do not necessarily
originate from the source phrase.
Gaussier, Hull & Aït-Mokhtar (2000) proposed to apply the parsing stage to
both halves of the parallel text. Then, following the one-to-one alignment process, the
probabilities of connections involving MWUs are estimated by multiplying the elements’ connection probabilities. This type of connection is tried only if a certain syntactic constraint is satisfied.3 A similar method was used by the RALI system (see
footnote 2 in p. 10) in the ARCADE project, but without imposing any syntactic constraint.
By definition, methods based on monolingual parsing are language-specific
and absolutely depend on the existence of a suitable tool for the concerned language.
Building this sort of tools for some languages is a complicated issue due to their morphological richness, writing system or the like. Moreover, even if a basic tool has al-
3
A connection involving a MWU is permitted only if the principal word of one unit, i.e. the head noun
(for noun phrases) or the main verb (for verb chunks), is aligned with the principal word of the candidate translation.
18
ready been made available, some idioms have irregular structures that obligate its further development or even preparing a list of exceptions. Therefore, though exploiting
monolingual parsing is expected to yield better results, there is still a genuine necessity in purely statistical, parsing-independent algorithms for detailed alignment.
The current research has taken the modest challenge of extending one existing
method for single-word alignment to handle MWUs without utilising any syntactic
parsing. As a very preliminary and short-term research, it ought to concentrate on a
relatively restricted, feasible sub-task. The alignment of contiguous multi-word sequences seemed an objective of that quality. The number of sequences up to a prespecified constant length is linear in the text’s length, which ensures a nonexponential complexity. Additionally, a lot of idioms in many languages are contiguous.
The word_align algorithm (Dagan, Church & Gale, 1993) was taken as the
core single-word alignment method. The basic idea of the presented extension is to
exploit the “centrifugal” property of this method’s dictionary induction phase. In other
words, to take advantage of the process’ nature to gradually augment the probabilities
of the better translations while diminishing the rest. The suggested method is titled
seq_align (following its principal origin, word_align). The following section gives a
concise survey of the word_align algorithm. The seq_align algorithm is introduced
and discussed in details in the rest of this dissertation.
2.4
The word_align algorithm
As mentioned above, the seq_align algorithm is an extension of the word_align algorithm (Dagan, Church & Gale, 1993). Similarly, the latter algorithm is a modification
of a previous algorithm—the IBM’s Model 2, proposed a bit earlier by Brown et al.
(1993). This section gives a survey of the word_align version of Model 2 in order to
19
supply the reader with the appropriate background for the understanding of the
seq_align extension.
The input of word_align is a parallel text, accompanied by a corresponding
rough alignment (either a sentence alignment or a list of alignment anchor points). In
an iterative manner, the algorithm induces a probabilistic bilingual dictionary which
corresponds to the given text, as well as an additional set of estimated values (see below). Then, within another pass over the text pair, it tries to assign an optimal source
word to each target word using the acquired probabilistic estimates.
The following notations shall be used in this dissertation for the description of
the word_align and the seq_align algorithms:

S, T—the source and target texts, respectively.

si, tj—the source- and target-text words located at positions4 i and j, respectively.

I—the initial input rough alignment corresponding to the text pair <S,T>.

I(j)—the source position corresponding to the target position j according to
I.
As stated above (in Section 2.2), the word_align algorithm assumes a direc-
tional translation model. More specifically, each target-text token tjT is assumed to
have been generated by exactly one source-text token si{S  s0, s0 = NULL}. The
NULL token is considered the origin of target tokens that have no source parallel.
Some source tokens, however, may be left unaligned.
An alignment, a, is defined as a set of connections, where a connection <i,j>
denotes that position i in the source text is aligned with position j in the target text.
4
Position—the location of a word relative to the beginning of the text.
20
The statistical model assumed by word_align describes the probability that T is the
translation of S by the equation:
Pr(T | S )   Pr(T , a | S )
)1(
a
where a ranges over all possible alignments.
The probabilistic bilingual dictionary is generated through the EstimationMaximisation (EM) technique (Baum, 1972; Dempster, Laird & Rubin, 1977), in accordance with the assumed probabilistic model.5 Given S, T and I as inputs, two sets
of parameters are estimated:
1.
pT(t|s)—Translation Probability—the probability that the target-language
word t is the translation of the source-language word s.
2.
pO(k)—Offset Probability—the probability that an arbitrary token si, which
is the real parallel translation of the token tj, is k words distant from sI(j)—its expected parallel according to the initial alignment (k = i – I(j)).6
Before the EM process is run, uniform values are given to all parameters.
Then, in an iterative manner, the parameters are re-estimated until they converge to a
local maximum or until a pre-specified number of iterations is reached.
In the first step of each iteration, every target token tj is examined independently as the possible translation of a set of source candidate tokens. These candidates are basically the tokens found within a distance of w words from sI(j) where w
is a pre-specified windowing range. In fact, some filters are applied to both target and
5
For more theoretical background, see (Brown et al., 1993; Dagan, Church & Gale, 1993).
6
A positive value of k means that si is located after the position at which it was expected to be found
according to the initial rough alignment. A negative value of k indicates that si appears before the expected position. A zero value occurs when si is located exactly where it was expected to be found.
21
source candidates in order to improve the results.7 In addition, the allowed ratio between the frequencies of any connected source and target tokens is normally bounded
as well. Yet, according to the probabilistic model, the probability of a connection
<i,j>, which equals the sum of the probabilities of all alignments that contain this
connection, is represented by
Pr( i, j ) 
W ( i , j )
 W (  i ' , j )
) 2(
i'
where W ( i, j )  pT (t j | si )  pO (i  I ( j ))
and i' ranges over all source positions (in the allowed window).
In the second step of each iteration, all pT and pO parameters are re-estimated
using the Maximum Likelihood Estimate (MLE), given by the following equations:
pT (t | s) 
pO (k ) 
i , j:t j t , si  s Pr( i, j )
i , j:si  s Pr( i, j )
i , j:i  I ( j )  k Pr( i, j )
i , j Pr( i, j )
)3(
)4(
Both equations estimate the parameter values as relative “probabilistic”
counts. The first estimate is the ratio between the probability sum for all connections
aligning s with t and the probability sum for all connections aligning s with any word.
The second estimate is the ratio between the probability sum for all connections with
offset k and the probability sum for all possible connections.
By the end of each iteration, all pT(t|s) smaller than a certain threshold are set
to 0. The new estimates of pT and pO are used in the next iteration to re-compute the
local connection probabilities, Pr(<i,j>). At the end of the iterative process, pT(t|s)
smaller than a certain final filtering threshold are set to 0, leaving only the most relia-
7
For instance, stop words can be eliminated from the candidate lists of both texts.
22
ble <s,t> pairs. These pairs, together with their final probabilistic estimates, are considered the dictionary for the alignment phase.
The optimal word alignment of the text is found based on the dictionary and
the final pO estimates. Once again, the local W(<i,j>) values are computed (the same
way they were determined at the first step of each EM iteration), but this time, each
target token is simply assigned the most probable source token within its permitted
window. In order to avoid erroneous connections, a threshold T is applied for each j
requiring that max(W(<i,j>))  T, where i  window of j, thus leaving some target
tokens unaligned. As already established above, a few target tokens may be aligned
with the same source token, but each source token may be connected with at most one
target token.
23
3
3.1
THE SEQ_ALIGN ALGORITHM
The extended model
The seq_align algorithm is an extension of the word_align translation model. According to the extended model, each text is regarded as a set of token sequences, each of
which is represented by its position and length (in tokens). Each of the two sequence
sets includes all sequences of lengths [1,Lmax] that origin from each position throughout the text. Let S’ and T’ denote the source and target sequence sets, respectively. si,l
and tj,m denote the source and target token sequences of lengths l and m that begin at
positions i and j, correspondingly.
According to the new model, each target sequence tj,mT’ is assumed to have
been generated by exactly one source sequence si,l{S’  s0,0, s0,0 = NULL}. The
NULL sequence is considered the origin of target sequences that have no source parallels. An alignment, a, is a set of connections, where a connection <i,l,j,m> denotes
that the source sequence of length l beginning at position i is aligned with the target
sequence of length m beginning at position j. The probability of a connection
<i,l,j,m> is represented by the following formula:
Pr( i, l , j , m ) 
W ( i , l , j , m 
Lmax
  W ( i ' , l ' , j , m  )
i ' l '1
)5(
where W ( i, l , j , m )  pT (t j ,m | si ,l )  pO (i  I ( j ))  p L (m | l )
and i’ ranges over all source positions (in the allowed window).
pL is an additional set of parameters which estimate the probability that a
source sequence of length l is translated into a target sequence of length m.
The re-estimation of pT, pO and pL is performed in the same fashion the two
former sets are treated in word_align:
24
pT (t | s) 
i ,l , j ,m:s  si ,l ,t t j ,m Pr( i, l , j , m )
i ,l , j ',m'[1, Lmax ]:s  si ,l Pr( i, l , j ' , m' )
)6(
where j' ranges over all target positions.
pO (k ) 
i ,l , j ,m:i  I ( j )k Pr( i, l , j, m )
i ,l , j ,m Pr( i, l , j, m )
)7(
i , j Pr( i, l , j , m )
i , j ,m '[1, Lmax ] Pr( i,l , j , m' )
)8(
p L (m | l ) 
Resembling word_align, pT(t | s) is the ratio between the probability sum for
all connections aligning s with t and the probability sum for all connections aligning s
with any target sequence. pO(k) is the ratio between the probability sum for all connections with offset k and the probability sum for all possible connections. The additional length probability estimate, pL(m | l), is the ratio between the probability sum
for all connections aligning source sequences of length l with target sequences of
length m and the probability sum for all connections aligning source sequences of
length l with target sequences of any length.
The integration of the length probability estimate into the EM process is based
on the common assumption that the mutual nature of any two languages involves a
certain level of consistency as to the lengths of corresponding sequences. The data
presented in Table 1 verify this assumption. Notice that long English sequences tend
to be translated into relatively short French sequences. The linguistic explanation for
this phenomenon is that prefix and suffix English units are often translated into infix
French units thus splitting the basic expression into two short sequences (as demonstrated in Section 4.3).
The detailed alignment is generated using a technique similar to that of
word_align, but adapted to handle multi-word units:
1.
For each tj,mT’, choose si,l such that:
25
arg max W ( i, l , j , m )
)9(
i ,l
where W ( i, l , j , m )  pT (t j ,m | si ,l )  pO (i  I ( j ))
2.
Eliminate all connections <i,l,j,m> for which W(<i,l,j,m>) < Th, where Th
is a pre-specified threshold.
For each connection <i,l,j,m>, produce m copies, associating a number j …
3.
( j + m – 1) to each copy.
4.
Sort the expanded connection list by the above numbering field.
5.
For each target position (denoted by the number), select the connection for
which W(<i,l,j,m>) is maximal.
m
l
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0.94
0.19
0.08
0.05
0.04
0.03
0.03
0.03
0.03
0.03
0.03
0.03
0.02
0.05
0.03
0.05
0.59
0.29
0.20
0.16
0.14
0.14
0.11
0.10
0.10
0.10
0.12
0.08
0.13
0.12
0.01
0.15
0.42
0.32
0.25
0.21
0.21
0.16
0.12
0.13
0.16
0.16
0.14
0.16
0.21
0.00
0.04
0.14
0.26
0.25
0.22
0.19
0.18
0.14
0.12
0.09
0.14
0.19
0.21
0.18
0.00
0.01
0.05
0.09
0.18
0.16
0.17
0.17
0.09
0.10
0.12
0.14
0.13
0.11
0.16
0.00
0.01
0.02
0.04
0.08
0.12
0.10
0.13
0.07
0.05
0.09
0.08
0.11
0.09
0.10
0.00
0.00
0.01
0.02
0.02
0.05
0.08
0.10
0.06
0.03
0.06
0.12
0.09
0.04
0.06
0.00
0.00
0.00
0.01
0.01
0.02
0.03
0.05
0.08
0.03
0.11
0.06
0.07
0.02
0.03
0.00
0.00
0.00
0.00
0.01
0.01
0.02
0.03
0.11
0.05
0.01
0.03
0.04
0.02
0.04
0.00
0.00
0.00
0.00
0.00
0.00
0.01
0.02
0.15
0.05
0.02
0.02
0.02
0.04
0.01
0.00
0.00
0.00
0.00
0.00
0.00
0.01
0.01
0.02
0.26
0.02
0.01
0.01
0.03
0.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.01
0.01
0.02
0.12
0.02
0.02
0.02
0.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.01
0.01
0.01
0.05
0.05
0.02
0.03
0.01
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.01
0.01
0.01
0.03
0.02
0.02
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.01
0.01
0.02
0.03
0.02
Table 1: Length probabilities computed for the English/French JOC corpus (see Section 4.2). l and m denote the
lengths of the English and French sequences, respectively. The elements are pL(m | l).
With a difference from the computation of W in the EM process, this time pL is
not included. That is because the pT estimates already introduce the length probabilities (since the lengths of s and t are parts of their entities, unlike the positional offsets,
which depend on the local constallation). In cases of identical translation and offset
probabilities, pL can be used to break the tie.
This alignment technique was tested among a few other methods, which generate a more coherent output, i.e. an alignment with no contradicting connections.
However, for our evaluation data (see Section 0), this simplest version gave the best
outcome.
26
3.2
Candidate selection
Trying to process all possible sequences, even up to a relatively small length, would
result in a very high complexity of time and space. The word_align algorithm, if not
imposing any restrictions on candidate words and connections, would need time and
space of O(NM), where N and M denote the number of tokens in the source and target texts, respectively. Applying seq_align in the same conditions would require
O(Lmax2 N  M). This theoretical situation motivates a serious effort to reduce the
number of candidate sequences. In addition, the existence of certain kinds of candidates may introduce some noise into the results. These candidates, too, should be
eliminated.
The model itself already includes two important filters. The first is the w windowing parameter, which bounds the text area where source candidates for a given
target candidate may reside. The frequency ratio limitations also focus the algorithm
on more likely connections.
As function words tend to be translated inconsistently, a common practice in
the alignment field is to dismiss them in advance using a pre-defined languagespecific stop-word list. Working with multi-word sequences, we used such lists to
eliminate any sequence which included only stop words. We refer to these units as
stop-word sequences.
Another kind of sequences unlikely to be an atomic unit is those containing
punctuation marks. These, too, were ignored.
The other two filters which we applied are related to the frequencies of the sequences within the texts. Sequences which appear only once within the text are problematic from two points of view: (a) they are very likely to be accidental, and (b) their
alignment is unreliable because it cannot be verified using any other source. Hence,
27
we excluded them from the candidate list of each text, which reduced the sizes of
these lists by more than 90%, due to data sparseness.
The last filtering process refers to sequences which appear in a unique context.
That is, sequences whose preceding or succeeding tokens are the same for all of their
occurrences. If the longer sequences which include these prefix and/or suffix tokens
are valid sequences, it is very difficult to determine which of the two sequences
should be aligned with a candidate from the other text. For example, if the sequence
school always appears within the context of high school, it is hard to know which of
them is aligned with lycée.
Though it is not an absolute truth, in most cases where a sequence appears only within a unique longer sequence, it is because the latter sequence is a meaningful
compound. For that reason, we decided to dismiss all sub-sequences of that characteristic. This was done by comparing the frequency f of each sequence with those of all
its sub-sequences and eliminating all sub-sequences whose frequency was equal to f.
As the data in Section 4.2 show, this filter is a very powerful instrument for reducing
the number of valid candidates.
3.3
Improvement of the dictionary
3.3.1
Noise-cleaning algorithm
Consider the raw dictionary entry presented in Figure 1. It can be easily rec-
ognised that each target sequence is either a sub-sequence of another or such that contains another. Obviously, two sequences where one is contained within the other are
rarely correct translations at the same time. A further observation reveals that the correct translation is not assigned a higher probability than those of two wrong suggestions. In some other examples, the correct translation was given even the lowest probability relative to all other options.
28
Source
local and regional authority
Freqs
6
Target
autorité local
autorité local et
autorité local et régional
local et
local et régional
et régional
le autorité local et régional
le autorité local et
Freqt
60
10
5
48
10
33
4
8
pT(t|s)
0.142533
0.142533
0.142533
0.122089
0.122089
0.120195
0.104451
0.103579
Figure 3: A sample raw dictionary entry sorted by translation probability
A study of a few dozens of examples has led to the conclusion that there are
three factors which are relevant to the selection of the correct translation. A balanced
composition of these factors has resulted in the heuristic algorithm presented below.
Suppose t1 and t2 are two target sequences that satisfy t1 t2, i.e. t1 is a subsequence of t2. Since unique-context sequences are eliminated from the candidate lists
(as explained above in Section 3.2), t1 cannot have the same frequency of t2. Rather, t1
must be more frequent than t2, because any occurrence of t2 contains an occurrence of
t1, but not vice versa. Consequently, a greater probability of t1 is not an absolute indication of its preference. On the contrary, a higher probability of t2 is certainly a good
reason to favour it.
As mentioned above (in Section 2.3), the ideal case of bilingual alignment is
where each unit has exactly one parallel in the other half of the text. Though such a
situation is a utopia, some units do satisfy this condition or are very close to do so.
The empirical experience indicates that even in less perfect instances, a relatively
small difference between the frequencies of the source and target units is a good clue
for the correctness of a given translation. Nevertheless, this heuristic has shown to be
a successful selection criterion only where the smaller difference belongs to the longer
sequence (t2). In other cases, the probabilities have seemed to play a very instrumental
role in making the correct decision.
29
The resulting heuristic states as follows: Given the dictionary entry of an arbitrary source sequence s, the frequency difference of each translation sequence t is defined as the difference between the frequencies of t and s. For each pair of translation
sequences where one is a sub-sequence of the other, if the longer sequence has a
smaller frequency difference, then it should be selected regardless of its probability. If
the differences are equal, then the shorter sequence must be supported by a better
probability in order to be favoured. When the shorter sequence has the smaller difference, then it should be chosen unless the longer sequence provides a probabilistic evidence for its superiority.
The cleaning algorithm works in two phases. In the first phase, every pair of
target sequences (which satisfy the containment condition) is tried. One of the candidates is invalidated using the heuristic rules while adding the identity of the other sequence to its list of “better translations”. In the second phase, the probabilities of the
invalidated sequences are equally distributed between those sequences which “overcame” them, provided that those sequences have remained valid. A pseudo-code of
the entire process is given below.
Definitions
s—the source sequence
| t |—number of translations
ti—the ith target sequence in the translation list
pi = pT(ti | s)
f(x)—the frequency of the sequence x
di = | f(ti) – f(s) |
Algorithm
1
2
3
4
CHOOSEANDINVALIDATE(i, j)
invalidj = true; // invalidating j
betterj = betterj  {i}; // adding i to the list of “better translations” of j
MAIN
// First phase: invalidation
for i = 1 to | t | {
30
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
for j = (i + 1) to | t | {
if (invalidi && invalidj)
continue;
if (ti  tj)
(short, long) = (i, j);
elsif (tj  ti)
(short, long) = ( j, i);
else // No containment
continue;
if (dlong < dshort)
CHOOSEANDINVALIDATE(long, short);
elsif (dlong == dshort)
if (pshort > plong)
CHOOSEANDINVALIDATE(short, long);
else
CHOOSEANDINVALIDATE(long, short);
else // dlong > dshort
if (plong > pshort)
CHOOSEANDINVALIDATE(long, short);
else
CHOOSEANDINVALIDATE(short, long);
} // for j
} // for i
28
29
30
31
32
33
34
35
36
37
38
// Second phase: probability distribution
for i = 1 to | t | {
unless (invalidi)
continue;
count = number of valid indexes in betteri;
if (count == 0)
continue;
p = pi / count;
foreach j (valid indexes in betteri)
pj += p;
} // for i
Figure 4 shows the clean dictionary entry achieved after the application of the
algorithm on the raw entry of Figure 3:
Source
local and regional authority
Freqs
6
Target
autorité local et régional
Freqt
5
pT(t|s)
0.896423
Figure 4: The clean dictionary entry
31
3.3.2
Prefix and suffix stop-words
As a consequence of the statistical model, the translation probabilities are computed
for source-language units, while the estimation of local connection probabilities is
done for target-language units (see Sections 2.4 and 3.1). Similar to the latter operation, the detailed alignment is also done by selecting the best source candidate for
each target unit. Therefore, the dictionary used in that process should consist of targetlanguage entries instead of source-language entries. This resource is obtained by
simply sorting the EM process output by target sequence. The resulting dictionary
might be referred to as inverted.
Yet, consider Figure 5, which presents an example of a raw entry of such an
inverted dictionary. A quick look at the source entries can identify an interesting phenomenon: many suggested translations include prefix and/or suffix sub-sequences of
function words or, as we have already labeled them, stop words. For example, aid for
the, this aid, be grant to etc..
The occurrence of necessary prefix and suffix stop-word sequences is certainly
possible. For instance, the expression dans le cadre de may be translated into in the
framework of. However, when the same basic sequence is surrounded by different
prefixes or suffixes, it is a good reason to believe that all of them are definitely redundant. A less but still reliable indication of redundancy is where the basic sequence occurs in the translation list of a dictionary entry only once, but is a sub-sequence of another translation.
As to improve the quality of the alignment, we applied the simple method presented in Figure 6 to eliminate this kind of noise. Figure 7 presents the improved dictionary entry for the sequence aide as generated by the this algorithm.
32
Target
aide
Source
aid for
aid for the
aid from
aid from the
aid have
aid in
aid in the
aid measure
assistance
assistance for
assistance in
assistance to
assistance to the
be grant to
community assistance
donor
for aid
grant to
grant to the
subsidy for
such aid
support from
this aid
aid
financial aid
aid to
financial assistance
christine oddy
of humanitarian
the victim of
pT(t | s)
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.863936
0.709433
0.487991
0.425051
0.290366
0.226983
0.182403
Figure 5: A sample of a raw entry of an inverted dictionary
Definition
The seed of each translation is the longest sub-sequence beginning and ending
in a content word.
Operation
1.
If two or more translations share the same seed, replace all of
them by a single source entry comprised of the seed itself and the
highest probability of all original sequences.
2.
If a seed occurs in the translation list of an entry only once, remove the affix stop words, provided that this seed is a subsequence of another translation.
Figure 6: Stop-word affix elimination algorithm
33
Target
aide
Source
aid
aid measure
assistance
community assistance
donor
grant
subsidy for
support from
financial aid
financial assistance
christine oddy
of humanitarian
the victim of
pT(t | s)
1
1
1
1
1
1
1
1
0.709433
0.425051
0.290366
0.226983
0.182403
Figure 7: The improved entry
It should be noted that there are certainly many cases where the exact translation is concatenated with redundant content words. These cases, though, cannot be
easily distinguished from cases where the longer translation is the correct one. Applying the noise-cleaning algorithm of Subsection 3.3.1 on the inverted dictionary is not
trivial due to the lack of direct relation between the probabilities.
When we applied the above method on the dictionary used for the evaluation
of detailed alignment (see Subsection 4.4.1), the overall F-measure has raised by 13%
(from 41% to 54%), the recall by 8% and the precision by 15%.
34
4
4.1
RESULTS AND EVALUATION
Evaluation methodology
The seq_align algorithm yields two types of output: (a) A bilingual sequence dictionary, and (b) a detailed sequence-level alignment. As established above, the former
output is given as an input for the alignment process which generates the latter. However, both outputs might be valuable within the process of machine-aided human
translation, which is the principal motivation of developing alignment algorithms (see
above Section 2.1).
A human translator trying to find a suitable and conventional translation for a
given term would first search an entry for that term within a bilingual glossary. Then,
if the term’s entry is found, the translator would scan the proposed translations to pick
up a proper translation. The quality of such a glossary is measured primarily by its
entry coverage—the extent to which any given term (within the glossary’s domain) is
likely to be found among its entries. Assuming a reasonable entry coverage rate, the
next important question is whether the correct translation appears within the suggested
translation list and at which relative position (first, second or a lower-ranked option).
The higher the position, the shorter the search time needed to retrieve the desired
translation. When a corresponding bilingual concordance is accompanying the glossary, even partial hits are mostly beneficial, since the translator can find the complete
translation by examining the corresponding alignments.
The quality of the detailed alignment is obviously also substantial for the efficient work of the translator. This quality is usually measured in terms of precision and
recall. Precision is the extent to which the translations suggested by the alignment‘s
connections are correct (ignoring unaligned sequences). Recall is the measure telling
35
how many sequences (out of all valid sequences) in one half of the text were aligned
with their correct counterparts in the other half (i.e. the alignment’s coverage rate).
The way of computing the precision and recall of an alignment depends on the
desired application. If only full and exact alignments are useful, then a strict definition
of these measures is required, such that scores are given only if the suggested translation consists of all reference words and no redundancies. For the human translator,
however, a more relaxed definition is acceptable, such that gives partial scoring to
partial successes. More specifically, for a given word sequence in one text, the precision of a translation can be defined as the ratio between the number of correct words
suggested by the alignment and the total number of suggested words. Similarly, the
recall would be the ratio between the number of correct words suggested by the
alignment and the number of correct words stated in the reference connection list.
Equation (10) describes these definition in formulae. The aggregate precision and recall are therefore a simple average of these values for all reference connections.
# of correct words
# of all suggested words
# of correct words
recall 
# of reference words
precision 
)11(
A widely accepted measure for the overall quality of an alignment is the Fmeasure, which combines the precision and recall as follows:
F 2
precision recall
precision recall
)11(
The evaluation presented in this dissertation was performed using the data of
the ARCADE project, as detailed in Section 4.2 and Subsection 4.4.1. The alignment
systems which participated in this project were judged according to the above definitions of precision, recall and F. These definitions were adopted for seq_align’s evaluation in order to enable comparing the results with those achieved during the
36
ARCADE project. This comparison was done in addition to the obviously-requested
assessment of seq_align versus word_align. Both evaluations are presented and discussed in Section 0.
4.2
The test corpus
The test corpus selected for the evaluation of the seq_align algorithm is comprised of
the English and French versions of the European Union’s JOC corpus, a text pair
which had previously been used in the word track of the ARCADE evaluation project
(Véronis & Langlais, 2000). The ARCADE project was the first framework in which
word-level text alignment systems have been comparatively evaluated. This evaluation was intended to create a world-wide benchmark for the state of the art. Using the
ARCADE data for the evaluation of seq_align enabled the comparison of the results
to the global standard.
The JOC corpus is a collection of written questions on various topics, directed
to the Commission of the European Parliament, each of which is followed by the corresponding answer, given by the relevant official. The large variety of related topics
results in an enormous quantity of specialised terms from distinct domains.
The texts are supplied aligned at the paragraph level such that each pair of corresponding questions or answers in the two languages is marked by the same numeric
identifier. Nevertheless, as the translation is rather precise in terms of sentence contents and order, a linear alignment within the paragraph boundaries gives a sufficiently reliable rough alignment. It should be noted that the translation of the text is indirect in the sense that at least parts of the two texts were translated from another text
written in a third language.
37
The English raw text has about 1,050,000 words, whereas the respective
French text consists of circa 1,162,000 words. The tokenised-lemmatised versions of
these texts contain around 1,171,000 and 1,423,000 tokens, respectively.8
As stated above in Section 0, besides sequences occurring only once and those
containing punctuation marks, two additional kinds of candidate sequences are filtered
out before the dictionary is induced: (a) Unique-context sub-sequences, and (b) Stopword sequences. Table 2 presents the frequency distribution of the English candidate
sequences of each length separately, as well as the totals for each length and for each
frequency range; the aggregate total number of sequences is displayed at the bottomright corner. Table 3 gives the same kind of data as observed after the first filtering
process. Table 4 reports the counts performed on the final candidate list, as attained
after applying both cleaning processes. The parallel information concerning the
French text is given in Table 5, Table 6 and Table 7.9
The data presented in the tables show that the contextual filtering drops the
number of candidates to about one third, a quantitative effect which the elimination of
stop-word sequences does not have. Nevertheless, since these sequences appear in rather high frequencies, their elimination from the candidate list avoids the creation of a
large number of noisy connections during the dictionary induction process.
8
Both texts were tokenised and lemmatised using the Decision TreeTagger, kindly supplied by the IMS
institute,
University
of
Stuttgart,
Germany
(http://www.ims.uni-stuttgart.de/projekte/corplex/
TreeTagger/DecisionTreeTagger.html).
9
The maximal sequence length was set to the arbitrary value of 15, which seemed a rather reasonable
limitation on the length of phrasal units. Empirical experience indicates that lowering this threshold
does not significantly change the quality of results.
38
frq.
lng. 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total
2–9
10–99
100–999
8169
59409
82776
63267
41879
28086
20273
15525
12350
10127
8450
7143
6093
5200
4451
373198
3568
11455
7241
3053
1370
739
462
294
208
155
107
74
51
36
24
28837
1059
954
334
138
95
79
60
46
32
18
4
0
0
0
0
2819
1,000–
9,999
106
35
11
3
1
0
0
0
0
0
0
0
0
0
0
156
10,000–
99,999
11
2
0
0
0
0
0
0
0
0
0
0
0
0
0
13
100,000–
999,999
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Total
12913
71855
90362
66461
43345
28904
20795
15865
12590
10300
8561
7217
6144
5236
4475
405023
Table 2: Frequency distribution of the English candidate sequences before filtering. Each row details the distribution of sequences of the indicated length.
frq.
lng. 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total
2–9
10–99
100–999
5714
39433
44764
26836
12839
5627
2801
1516
862
535
362
232
210
162
127
142020
3456
10777
6261
2316
816
315
151
75
48
30
23
12
8
8
3
24299
1050
920
279
79
24
15
1
10
12
14
4
0
0
0
0
2408
1,000–
9,999
106
34
11
2
1
0
0
0
0
0
0
0
0
0
0
154
10,000–
99,999
11
2
0
0
0
0
0
0
0
0
0
0
0
0
0
13
100,000–
999,999
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Total
10337
51166
51315
29233
13680
5957
2953
1601
922
579
389
244
218
170
130
168894
Table 3: Frequency distribution of the English candidate sequences after the elimination of unique-context subsequences. Each row details the distribution of sequences of the indicated length.
39
frq.
lng. 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total
2–9
10–99
100–999
5688
38255
43194
26241
12764
5621
2799
1516
862
535
362
232
210
162
127
138568
3395
10061
5844
2267
813
315
151
75
48
30
23
12
8
8
3
23053
973
740
257
79
24
15
1
10
12
14
4
0
0
0
0
2129
1,000–
9,999
69
14
11
2
1
0
0
0
0
0
0
0
0
0
0
97
10,000–
99,999
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
2
100,000–
999,999
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Total
10126
49071
49306
28589
13602
5951
2951
1601
922
579
389
244
218
170
130
163849
Table 4: Frequency distribution of the English candidate sequences after both unique-context sub-sequences and
stop-word sequences have been eliminated. Each row details the distribution of sequences of the indicated length.
frq.
lng. 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total
2–9
10–99
100–999
8198
50384
86274
89265
69989
50950
36737
27312
21257
17040
13971
11640
9773
8263
7040
508093
3661
11483
11380
6677
3377
1739
992
597
423
324
252
198
168
142
123
41536
1100
1323
823
308
139
93
80
61
47
33
19
5
1
0
0
4032
1,000–
9,999
106
63
23
7
3
1
0
0
0
0
0
0
0
0
0
203
10,000–
99,999
10
3
0
0
0
0
0
0
0
0
0
0
0
0
0
13
100,000–
999,999
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
Total
13077
63256
98500
96257
73508
52783
37809
27970
21727
17397
14242
11843
9942
8405
7163
553879
Table 5: Frequency distribution of the French candidate sequences before filtering. Each row details the distribution of sequences of the indicated length.
40
frq.
lng. 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total
2–9
10–99
100–999
5193
28362
42988
36329
22637
12625
6867
3530
1902
1146
750
542
397
284
213
163765
3509
10452
9822
5106
2235
963
438
170
86
53
41
16
21
12
16
32940
1091
1269
759
241
68
17
11
5
8
13
13
4
1
0
0
3500
1,000–
9,999
106
63
23
6
2
1
0
0
0
0
0
0
0
0
0
201
10,000–
99,999
10
3
0
0
0
0
0
0
0
0
0
0
0
0
0
13
100,000–
999,999
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
Total
9911
40149
53592
41682
24942
13606
7316
3705
1996
1212
804
562
419
296
229
200421
Table 6: Frequency distribution of the French candidate sequences after the elimination of unique-context subsequences. Each row details the distribution of sequences of the indicated length.
frq.
lng. 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total
2–9
10–99
100–999
5183
27688
41707
35612
22405
12560
6855
3528
1901
1146
750
542
397
284
213
160771
3476
9902
9302
4965
2204
958
436
170
86
53
41
16
21
12
16
31658
1038
1087
690
234
68
17
11
5
8
13
13
4
1
0
0
3189
1,000–
9,999
78
39
21
6
2
1
0
0
0
0
0
0
0
0
0
147
10,000–
99,999
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
2
100,000–
999,999
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Total
9776
38717
51720
40817
24679
13536
7302
3703
1995
1212
804
562
419
296
229
195767
Table 7: Frequency distribution of the French candidate sequences after both unique-context sub-sequences and
stop-word sequences have been eliminated. Each row details the distribution of sequences of the indicated length.
41
4.3
Dictionary evaluation
The ideal set of parameters used in complex statistical processes, such as seq_align’s
dictionary induction process, highly depends on the processed text pair and thus is
almost impossible to extrapolate. Therefore, this setting is always determined based
on previous empirical experience, leaving a certain “safety range” as to take into account some unprecedented or extraordinary deviations.
The evaluated dictionary was produced using the following parameter setting
(see above in Sections 2.4, and 3.1, where the EM process is discussed in details):

Lmax (maximal sequence length) = 15 (see footnote 9 on p. 38).

w (the width of the allowed window for source text candidates) = 15: Determined according to the relatively high reliability of the initial rough
alignment (see above in Section 4.2).

Iteration filtering threshold = 0.01 (only very unreliable connections were
eliminated).

Number of iterations = 10: Shown to be sufficient for the convergence of
the model.

Final filtering threshold = 0.1: Relatively reliable translations.

Maximal frequency ratio = 100: This is a very high value. It was determined because the sample of the ARCADE project had concentrated on
multi-contextual words having many different and somethimes rare translations. This parameter setting resulted in a certain level of noise.
The evaluation presented below was done after the dictionary had been refined
by the noise-cleaning algorithm, described in Subsection 3.3.1. As explained above
(in Section 4.1), two aspects of the dictionary’s quality were evaluated: (a) Coverage
42
rate, and (b) Quality of existing entries. Both assessments were based on a random
sampling of sequences. As the coverage evaluation is intended to estimate the chance
to find a dictionary entry for a given term, the reference list for this task was picked of
the final list of English candidates (induced from the text as described in Section 3.2).
With a difference, the reference list for the qualitative entry evaluation was extracted
from the list of dictionary entries.
The candidate lists for both evaluations were restricted to sequences likely to
be searched by a human translator. More specifically, sequences beginning or ending
in stop-words or numbers were excluded because English terms are rarely of this
structure. In addition, sequences occurring within the text less than 5 times were filtered out since they are less likely to be specialised terms. The random samples taken
from these partial lists were finally cleaned of non-term sequences by a human judge.
The coverage sample consisted of 85 terms, of which 78 (91.8%) existed as
dictionary entries. Table 8 presents the entire entry’s quality sample, containing 83
English terms (selected by a human judge regardless of the first sample) along with
their suggested translations. Table 9 summarises the results of the evaluation of this
sample.
The First, Second and Third categories refer to terms whose full and exact
translations were found at the first, second and third places within the suggested translation list, respectively. The Split category relates to those terms whose translation is
given completely, but is broken into more than one target sequence in the dictionary.
For instance, the English term research centre should be translated into French as
centre de recherche. However, the dictionary offers the separate sequences le centre
and de recherche.
43
English source term
joint answer to write question (149)
French translations
réponse commune à le question (0.8217,
145)
former yugoslavia (98)
le yougoslavie (1, 100)
parliament 's secretariat (66)
de le parlement européen (0.442128,
426), et à le secrétariat (0.143291, 87)
euratom treaty (57)
le traité euratom (0.766732, 57), le article
(0.110678, 1433)
invitation to tender (46)
appel de offre (1, 92)
infringement procedure (43)
procédure de (0.546413, 390), infraction
(0.453586, 249)
eastern european country (40)
le pays de (0.966365, 409)
social rights (30)
le droit (0.913762, 1565)
prime minister (25)
premier ministre (0.624228, 25), le premier (0.375772, 321)
objective 1 region (24)
de le objectif № 1 (1, 34)
implementation of directive (23)
application de le (0.797454, 603), de le
directive (0.202547, 973)
community support frameworks (22)
cadre
communautaire
de
appui
(0.979087, 102)
respect of human right (22)
de le droit de (0.3009525, 557), le droit
de le (0.2594965, 671), droit de le
homme (0.212904, 555)
research centre (21)
de recherche (0.58948, 336), le centre
(0.364115, 223)
high education diploma (20)
le diplôme de (0.548492, 38), de enseignement supérieur (0.42465, 37)
infringement of community law (19)
le droit communautaire (0.644084, 251),
infraction (0.350831, 249)
veterinary medicine (19)
le médicament vétérinaire (1, 20)
geneva convention (18)
convention de genève (1.000001, 16)
share the view (18)
partager (1, 148)
nuclear installation (17)
le installation nucléaire (0.617818, 32),
de le installation (0.32276, 96)
application of directive (16)
application de le directive (0.72932, 113),
le application (0.27068, 508)
fishery resource (16)
le ressource (1, 236)
community participation (15)
participation de le communauté (1, 18)
selection criterion (15)
de sélection (0.601411, 33), critère de
(0.278096, 109), le critère (0.120494,
216)
flora and fauna (14)
le faune (0.534696, 71), le flore
(0.465304, 54)
small and medium size enterprise (14)
petit et moyen entreprise (0.926607, 72)
commission report (13)
rapport (1, 1108)
northern ireland (13)
irlande de le nord (1, 12)
secretariat of the european parliament secrétariat général de le parlement
(13)
(0.5189315, 122), le secrétariat général
de le (0.1905355, 139)
harmful substance (12)
substance (1, 259)
44
English source term
cotton producer (11)
financial regulation (11)
regional and local body (11)
toy safety (11)
german law (10)
legal instrument (10)
technical progress (10)
copyright protection (9)
emilia romagna (9)
forest area (9)
selection board (9)
air traffic (8)
civil liability (8)
heat treatment (8)
positive action programme (8)
solitary confinement (8)
trade union right (8)
academic staff (7)
car tax (7)
community import (7)
non technical summary (7)
radio station (7)
school nurse (7)
stainless steel (7)
available information (6)
committee of inquiry (6)
French translations
le producteur de coton (0.7178635, 7), de
le producteur (0.1715235, 79)
le règlement financier (0.86068, 11)
le collectivité régional et local (0.853599,
17)
sécurité de le jouet (0.815052, 22), le
sécurité (0.184948, 400)
de le loi (0.4649225, 118), le loi allemand
(0.4608975, 9)
juridique (1, 281)
le progrès technique (0.913113, 10)
protection de le (0.27091, 693), de le
droit de (0.25928, 557), le protection
(0.146291, 612), de auteur (0.118302, 48)
romagne (0.80016, 9), le zone (0.19984,
472)
forêt (1, 283)
de le jury (0.913342, 14)
trafic aérien (1.000001, 7)
responsabilité civil (0.860281, 11), le responsabilité (0.139719, 158)
traitement thermique (0.355835, 10), le
bois (0.206769, 97), bois de (0.186439,
23), minute (0.135507, 9)
en faveur de (0.3736175, 526), programme de action positif (0.213217, 6),
faveur de le (0.1868085, 433)
isolement (1, 15)
syndical de le (0.426879, 6), droit syndical (0.345815, 10), agent de (0.18851, 28)
recrutement de (0.3120455, 29), de recrutement (0.2893015, 22), enseignant
(0.154137, 40)
taxe automobile (0.893604, 2), local №
(0.106396, 2)
le importation communautaire (1, 14)
résumé non technique (0.715219, 7), le
article 5.2. (0.173798, 2)
radio commercial (0.972542, 4)
infirmier scolaire (0.598184, 6), de infirmier (0.12156, 9)
en acier (0.452073, 12), tube (0.263171,
11)
disponible (0.569471, 241), tronçon
(0.430529, 13)
commission de enquête (0.52114, 8),
enquête de le (0.322032, 18), de le parlement (0.102073, 564)
45
English source term
european bank (6)
future cooperation (6)
hygiene directive (6)
insurance organization (6)
legislative measure (6)
medical assistance (6)
national congress (6)
purification system (6)
social situation (6)
aid budget (5)
competition for recruitment (5)
dental practitioner (5)
ec information (5)
energy issue (5)
european network of high speed train (5)
freedom of movement of person (5)
health authority (5)
illegal discharge (5)
language capability (5)
French translations
européen pour (0.26379, 130), banque
européen (0.259424, 62), le reconstruction (0.254201, 27), le développement
(0.222585, 584)
coopération futur (0.6936005, 4), leur futur (0.2644605, 2)
sur le hygiène (0.524822, 3), de directive
(0.236108,
225),
proposition
de
(0.177081, 520)
grec de (0.411557, 97), oga (0.180682,
3), organisme (0.176888, 359)
le mesure législatif (0.967564, 13)
assistance médical (0.7736, 7), médical
en (0.201264, 5)
le congrès national (1, 6)
domestique de (0.414207, 6), de épuration (0.363314, 39), appareil (0.169361,
48)
situation économique et (0.643715, 5), de
patras (0.341051, 46)
de aide (0.37485, 356), le véhicule à moteur et de (0.125203, 4), sur le véhicule à
moteur (0.124907, 4), et de le taxe
(0.123036, 2)
condition de attribution (0.263102, 3), de
monsieur virginio bettini (0.263102, 9),
dernier concours (0.178227, 3), attribution et (0.131551, 4), périphérique
(0.131551, 39)
art dentaire (0.445513, 13), praticien
(0.367018, 7), le art (0.180626, 16)
juridique de (0.257913, 59), protection
juridique (0.252701, 12), consommateur
et (0.178553, 25), et worldcom
(0.169339, 3), information sur (0.141494,
173)
à le énergie (0.527547, 17), de le question
(0.176263, 215)
réseau européen de train à grand vitesse
(0.330116, 9), publique (0.131728, 401),
rendre (0.131728, 337)
libre circulation de le personne (0.87184,
31)
le autorité sanitaire (0.818522, 4), le santé
(0.181478, 419)
illicite en (0.650023, 4), rejet (0.349976,
96)
le compétence linguistique (0.871604, 4)
46
English source term
local tax (5)
French translations
taxe (0.599337, 314), frapper le produit
(0.400663, 4)
le csce et (0.264953, 7), humain de le
(0.253833, 5), de le expert (0.143893,
44), et le rapport (0.143893, 14), le réunion (0.100291, 169)
pilote (0.576723, 42), que à le moins
(0.377826, 6)
un particulier (1, 6)
recherche dans le domaine de le
(0.692034, 19)
de rééducation (0.438528, 5), construction de (0.292352, 234), un centre
(0.142234, 55)
soutien (1, 350)
le union de coopérative agricole
(0.999999, 8)
meeting of expert (5)
pilot plant (5)
private individual (5)
radiation protection research action (5)
rehabilitation centre (5)
support activity (5)
union of agricultural cooperative (5)
Table 8: Sample of entry’s quality. The integers in parentheses indicate the sequences’ frequency within the text.
The real numbers in the French translations column represent the probabilities of the translations as computed by
the seq_align algorithm and re-estimated by the noise-cleaning algorithm
Category
First
Second
Third
Split
Partial
Erroneous
Count
42
5
1
17
17
1
%
50.6
6.0
1.2
20.5
20.5
1.2
Table 9: Entry’s quality evaluation
Apparently, this phenomenon could have been perceived as a weakness of the
seq_align algorithm because it does not indicate the contiguous translation. Nevertheless, looking within the text reveals that most, if not all of these target terms are actually split. For example, the term national research centre is translated into centre national de recherche, whereas community research centre yields centre communautaire de recherche. This split display of the translation does not only give the translator the knowledge that the translation might be broken, but also points to the exact
location where this break normally happens.
The Partial category refers to those terms which were not given the entire
translation (either successive or split), but are assigned principal parts of the expected
47
target expressions. Though these entries do not supply the translator with the entire
needed information, the bilingual concordance (based on the detailed alignment) can
help finding the missing pieces.
In fact, the first 4 categories deal with entries which give full and precise information. In other words, approximately 79% of the entries provide the translator all
of the necessary knowledge. Adding the 20.5% of partial hits, which are valuable as
well, it can be concluded that almost 100% of the dictionary’s entries can be useful
for a human translator.
4.4
Alignment evaluation
This section describes two different evaluations. The first relates to the ARCADE project full reference list, whereas the second focuses on multi-word terminology from
the viewpoint of a human translator. The term-wise evaluation was done on the same
dictionary used for the evaluation of Section 4.3, which had been cleaned by the
noise-cleaning algorithm. The ARCADE results presented here relate to the unclean
dictionary, just improved by the simple stop-word elimination algorithm (see Section
3.3). Experiments showed that though this dictionary is of a much lower quality for a
human translator, a certain amount of useful information is lost during the cleaning,
which decreases the grade for the ARCADE sample.
As mentioned in Section 2.4, a minimal probability threshold T is pre-defined
to avoid noisy connections. Experiments showed that a rather reasonable value for this
parameter is 0.005, though slight changes do not make a large difference.
4.4.1
The ARCADE evaluation
The ARCADE project (Véronis & Langlais, 2000) was intended to organise a
comparative evaluation between different systems for text alignment. One of the project’s tracks was dedicated for word and expression alignment.
48
The reference list for the word track was prepared as follows: 60 French words
were chosen—20 adjectives, 20 nouns and 20 verbs. Each of these words appears
within many different contexts, including multi-word idiomatic expressions. About 60
occurrences of each distinct word across the JOC corpus were marked up. Then, two
human annotators were asked to mark the entire French expression within which each
of the words appeared and then the English counterpart of that expression.
That way, a set of 3723 French/English word/expression pairs was created.
The annotation of each human judge was preserved even when there was a disagreement on either the French or English unit. In such cases, the evaluation procedure was
instructed to take the better grade of two. It should be noted that some of the marked
expressions were split (non-contiguous).
Naturally, in some cases no English correspondence existed, thus the annotators had to leave the English column blank. The precision and recall for that event
were defined as 1 for a blank answer and 0 otherwise.
The task set for the systems participating in the ARCADE competition was
equivalent to that given to the human annotators, i.e. to (automatically) identify the
French expression which possibly encapsulates the reference word and to indicate its
translation within the English text. The publicised results, however, relate only to the
latter part, that of finding the correct translations.
Five research groups responded to the ARCADE challenge and sent their system results for the above-mentioned reference list. The best results were achieved by
the system of the Xerox Research Centre Europe (XRCE), Grenoble, France, developed by Éric Gaussier and David Hull (as for their method, see above in Sections 2.2
and 2.3). The other participants were the Linköping Word Aligner (Ahrenberg, Andersson & Merkel, 2000), and the systems developed by the RALI group (University
49
of Montreal, Canada), the CEA group (Gif-sur-Yvette, France) and the LILLA group
(Nice, France). The alignment methods applied by the latter three systems have never
been published. Table 10 details the achievements of each system in each of the three
grammatical categories—adjectives (A), nouns (N) and verbs (V)—as well as the
overall averages. The table also gives the same kind of data regarding the original
word_align algorithm and its seq_align extension. The four systems who have not
won the competition are labelled S1…S4, as done in (Véronis & Langlais, 2000), in
order not to embarrass any participant.10
As indicated by the table, both word_align and seq_align have achieved the
same overall F percentage. Nonetheless, there are some differences between these two
algorithms in terms of precision and recall. It can be said that word_align is more precise, while seq_align has a better coverage rate.
In comparison with the other participants, both algorithms take the fourth
place in the list (between S2 and S3). The detailed results indicate that both
word_align and seq_align have difficulties in aligning cases of inconsistent translation, which are not infrequent in the JOC corpus. As mentioned above (in Section
2.2), the winning XEROX system uses Hiemstra’s bi-directional model. Some authors
had previously argued that such a model is more robust, but their claim was not proven quantitatively. As most reference connections were of the one-to-one type, the
large difference between the results of our algorithms and those of the XEROX system must be a consequence of the underlying mathematical models. This leads to a
clear conclusion that Hiemstra’s model is better than IBM’s Model 2.
Cat Entries System Prec Rec
10
F
The full data have been kindly provided to us by Mr. Véronis. The accompanying analysis is based
on these data.
50
Cat Entries System
A
1167 S1
S2
S3
S4
XEROX
WA
SA
N
1055 S1
S2
S3
S4
XEROX
WA
SA
V
1501 S1
S2
S3
S4
XEROX
WA
SA
All
3723 S1
S2
S3
S4
XEROX
WA
SA
Prec
0.43
0.31
0.63
0.63
0.84
0.61
0.58
0.31
0.22
0.70
0.61
0.78
0.62
0.53
0.21
0.08
0.47
0.58
0.72
0.46
0.46
0.31
0.19
0.58
0.60
0.77
0.55
0.52
Rec
0.42
0.31
0.63
0.77
0.84
0.59
0.63
0.30
0.21
0.68
0.76
0.76
0.59
0.57
0.20
0.08
0.42
0.67
0.62
0.43
0.48
0.30
0.19
0.56
0.73
0.73
0.53
0.55
F
0.43
0.31
0.63
0.66
0.84
0.60
0.61
0.30
0.21
0.68
0.65
0.76
0.60
0.55
0.21
0.08
0.44
0.58
0.65
0.45
0.47
0.30
0.19
0.57
0.63
0.74
0.54
0.54
Table 10: The ARCADE evaluation data together with those of word_align (WA) and seq_align (SA)
As already mentioned above (in Section 2.3), the Xerox method uses language-specific syntactic knowledge to identify MWU candidates (as done by most
alignment methods). However, it seems that the principles of the extension of
word_align to seq_align, i.e. estimating the model’s parameters for all valid sequences while considering length relations, are applicable to Hiemstra’s model as well. Extending Hiemstra’s algorithm that way can yield a new robust parsing-independent
algorithm for single- and multi-word alignment.
51
4.4.2
Evaluation of word_align vs. seq_align
As the main goal of this research was to extend the word_align algorithm so that it
could handle multi-word units without using any syntactic parsing, a specific comparison between the two algorithms in this aspect is obviously required. As established
above, the interest in MWUs is primarily in the alignment of specialised terminology.
Due to time and manpower constraints, it had not been feasible to generate a representative sample of connections, such that would test the performance of the two algorithms on terms of various frequencies. Therefore, we had to suffice with a less ideal
sample, which we derived from the ARCADE sample by manually selecting connections where the French part was a multi-word term.
Most of the terms in the resulting sample are of very low frequencies (below 5
occurrences), a property which is not characteristic for principal terms in a domainspecific corpus. In addition, some of the terms were translated inconsistently (as seen
in the table below), which is also atypical for terminology. Nevertheless, the results
for this sample somehow reflect the relative levels of the judged methods.
The Termight method (Dagan & Church, 1994, 1997; see above in Section
2.3) is a simple way to extend single-word to multi-word alignment, without parsing
both texts. However, monolingual parsing is applied in order to identify the boundaries of the terms in one of the languages, then being able to build a glossary and a corresponding concordance. The situation of machine-aided human translation (as described above in Section 4.1) is such that the translator usually seeks for the translation of a specific term in the source language. Thus, it is unnecessary that the system
fully automatically identifies the boundaries of that term. Nonetheless, if parsing is
not available, it is impossible to prepare a glossary of a reasonable size in advance.
52
In order to fairly compare word_align with seq_align, we decided to try them
in the Termight context, where word_align is expected to yield its best results. Hence,
the English counterparts were defined as the concatenation of all words residing between the leftmost and rightmost alignments of the given French words. As mentioned
above (in Subsection 4.4.1), the original ARCADE reference list indicated only one of
the expression’s words, expecting the annotator/system to identify the rest. However,
since we have also had the human annotators’ results, we could replace the single
words with the entire expressions as determined by the judges.
For simplicity, we took only connections where there was a full agreement between the two annotators on both French and English parts. The sample and the corresponding results are presented in Table 11. The average precision, recall and F rates
are shown in Table 12.
The results resemble the ARCADE results in the numeric aspect as well as the
trend they indicate (i.e. word_align‘s advantage in precision and seq_align’s advantage in recall). A closer look at the detailed results reveals that a significant part of
seq_align’s precision problem is related with its tendency to accompany the translation with some surrounding words. In most cases, these additional words are those
words which were typically found around the translating unit. It is certainly expectable that if an expression appears in many different contexts and is translated consistently, the level of noise produced by the algorithm will decrease (as happens for the
more frequent terms such as compagnie aérienne, économie d’énergie etc.)..
It should be noted that the Termight method cannot be fully successful in handling cases of one-to-many alignments, because the word_align’s output suggests only a single source word for each target word. Nonetheless, many-to-one and many-to-
53
many alignments are workable for both algorithms, though seq_align is expected to
do better in cases of non-literal translations.
In light of the similar quality of the alignments generated by the two algorithms, it is important to stress that the main advantage of seq_align over word_align
is related to the dictionary. While word_align’s dictionary gives the translations of
single words only, seq_align’s dictionary is a single- and multi-word dictionary.
Word_align can provide a multi-word glossary only if syntactic parsing is applied to
the single-word alignment (as done by Termight (Dagan & Church, 1994, 1997)).
With a difference, seq_align manages to induce a high-quality bilingual glossary directly from the text, without using any syntactic knowledge.
Recall that a multi-word glossary, which supplies term translations to its user,
is also necessary for efficient retrieval of alignment examples. Hence, parsingindependent compilation of glossaries enables providing useful translation aids for
many pairs of languages even where reliable parsing is not available.
French
réserves
biologiques
lignes à haute
tension
haute priorité
haute
performance
haute
performance
haute
performance
word_align
English
Prec
0.00
English
(Reference)
habitat
Rec
0.00
English
seq_align
Prec
0.00
Rec
0.00
power lines
lines
1.00
0.50
lines
1.00
0.50
high priority
high
performance
high
performance
high
performance
high priority
high
performance
high
performance
high
performance
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.67
1.00
haute
performance
haute
performance
haute technicité
haute mer
high
performance
high
performance
high technology
sea going
high
mance
high
mance
perfor-
1.00
1.00
1.00
1.00
perfor-
1.00
1.00
1.00
1.00
0.00
0.18
0.00
1.00
0.00
1.00
0.00
0.50
haute technologie
ligne à haute
tension
haute tension
haute technologie
high technology
sea going vessel
in the first half
of 1993 to patrol
high technology
high priority
high
performance
high
performance
high
performance computing
high
performance
high
performance
to operate with
sea
1.00
1.00
high technology
1.00
1.00
power line
line
1.00
0.50
line
1.00
0.50
high voltage
high technology
high voltage
high technology
1.00
1.00
1.00
1.00
high
high technology
1.00
1.00
0.50
1.00
54
French
haute pression
haute tension
haute technologie
télévision
à
haute définition
télévision
à
haute définition
Télévision
à
haute définition
centre historique
centre historique
English
(Reference)
high pressure
high voltage
word_align
English
Prec
high pressure
1.00
voltage
1.00
Rec
1.00
0.50
high technology
high technology
1.00
1.00
seq_align
English
Prec
high pressure
1.00
high
voltage 0.22
supply market is
created between
the entities
high technology
1.00
0.00
0.00
HDTV
1.00
1.00
HDTV
Rec
1.00
1.00
1.00
HDTV
HDTV
1.00
1.00
HDTV in
0.50
1.00
HDTV
HDTV
1.00
1.00
HDTV in
0.50
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.23
1.00
full time
full time
full time
full time
time and ( b )
full
full time
Werke
full time
1.00
1.00
1.00
1.00
0.33
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.29
0.50
1.00
1.00
1.00
1.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
1.00
1.00
1.00
1.00
old town quarters
old town quarters
Community
funding for studies
connected
with restoration
of Palermo 's old
town quarters
time
full time
full time
full time
time and ( b )
full time
full time
plein temps
temps plein
temps plein
temps plein
temps plein
full time
full time
full time
full time
full time
temps plein
plein gré
temps plein
plein fouet
full time
willingly
full time
hard hit
temps plein
enseignement
secondaire
enseignement
secondaire
enseignement
secondaire
enseignement
secondaire
enseignement
secondaire
effets
secondaires
effets
secondaires
full time
secondary
school
secondary
school
secondary
school
secondary
school
secondary
school
side effects
full time
secondary
1.00
1.00
1.00
0.50
full time
unfair competition from third
countries
full time
secondary school
secondary
1.00
0.50
secondary school
1.00
1.00
secondary
1.00
0.50
secondary school
1.00
1.00
secondary
1.00
0.50
secondary school
1.00
1.00
secondary
1.00
0.50
secondary school
1.00
1.00
effects
1.00
0.50
effects
1.00
0.50
side effects
0.17
1.00
effects
1.00
0.50
programmes
secondaires
programmes
secondaires
sensibles
du
point de vue des
nitrates
à coup sûr
charge utile
chefs
d'entreprise
sub schemes
necrosis factor to
be used , while
avoiding
the
serious side effects
sub
1.00
0.50
0.00
0.00
sub schemes
sub
1.00
0.50
0.00
0.00
nitrate sensitive
encouragement
in nitrate sensitive
0.50
1.00
nitrate sensitive
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
100 tonnes
managers and
0.00
0.00
0.50
0.00
0.00
1.00
clear grounds
payload
managers
55
word_align
English
Prec
entrepreneurs
1.00
Rec
1.00
English
d'entre-
English
(Reference)
entrepreneurs
d'entre-
entrepreneurs
entrepreneurs
1.00
1.00
d'entre-
entrepreneurs
entrepreneurs
1.00
1.00
d'entre-
Employers
Employers
1.00
1.00
female entrepreneurs
female entrepreneurs
The Confederation of Galician
Employers
chefs
d'entreprises
chefs d'accusation
chefs d'inculpation
chefs
d'entreprise
chefs
d'entreprise
chefs
d'entreprise
chefs
d'entreprise
chefs d'inculpation
compagnie
aérienne
compagnie
aérienne
compagnie
aérienne
compagnie
aérienne
compagnie de
transports
aériens
compagnie de
transports
aériens
compagnie
aérienne
animaux
de
compagnie
employers
0.00
charges
charges
compagnie
aérienne
compagnie
aérienne
constitution
constitution
constitution
constitution
constitution
French
chefs
prise
chefs
prises
chefs
prises
chefs
prises
Détention
ventive
Détention
ventive
détention
ventive
détention
ventive
seq_align
Prec
0.00
Rec
0.00
0.50
1.00
0.50
1.00
0.20
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.50
1.00
businessmen
businessmen
1.00
1.00
property in Istanbul
of businessmen
businessmen
businessmen
1.00
1.00
of businessmen
0.50
1.00
businessmen
businessmen
1.00
1.00
of businessmen
0.50
1.00
entrepreneurs
entrepreneurs
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
1.00
charges
airline
airline
1.00
1.00
infringement of
the rights to
airline
airline
airline
1.00
1.00
airline
1.00
1.00
airline
airline
1.00
1.00
airline
1.00
1.00
airline
airline
1.00
1.00
0.50
1.00
airline
Turkish national
airline
0.33
1.00
airline authorities
Turkish national
airline
0.33
1.00
airline
Turkish national
airline
0.33
1.00
Turkish national
airline
0.33
1.00
airline
airline
1.00
1.00
airline
1.00
1.00
pets
0.17
1.00
1.00
1.00
1.00
pets
under
CITES , these
animals
airline
0.17
airline
pets
under
CITES , these
animals
airline
1.00
1.00
airline
airline
1.00
1.00
airline
1.00
1.00
setting up
setting up
building up
setting up
setting up
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
pré-
held on remand
0.00
0.00
0.00
0.00
pré-
held on remand
0.00
0.00
Editor of
0.00
0.00
pré-
held on remand
0.00
0.00
held on remand
1.00
1.00
pré-
held on remand
0.00
0.00
trade unionists
0.00
0.00
Voutsas
a computerized
undertakings and
groups of undertakings
56
French
économies d'énergie
économies d'énergie
économies d'énergie
économies d'énergie
économies d'énergie
formations
formations
formations
phase de lancement
lancement
lancement
English
(Reference)
energy saving
word_align
English
Prec
saving
1.00
Rec
0.50
seq_align
English
Prec
energy saving
1.00
Rec
1.00
energy saving
saving
1.00
0.50
energy saving
1.00
1.00
energy saving
saving
1.00
0.50
0.67
1.00
energy saving
saving
1.00
0.50
0.67
1.00
energy saving
saving
1.00
0.50
energy
saving
projects
energy
saving
projects
energy saving
1.00
1.00
training courses
training courses
training courses
start up phase
training
training
phase
0.00
1.00
1.00
1.00
0.00
0.50
0.50
0.33
training courses
training
start up phase
0.00
1.00
1.00
1.00
0.00
1.00
0.50
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
1.00
1.00
0.30
1.00
0.00
0.00
0.00
0.00
0.50
1.00
start up
starting up
technology and
know how to
start up phase
phase de lancement
phase de lancement
start up phase
phase
1.00
0.33
start up phase
phase ( 1992 96
0.33
0.33
lancement
organes
de
presse
organes
de
presse
organes
de
presse
organes institutionnels
organes
génitaux
organes
de
presse
passage en revue
passage
frontière
marche à pied
cultures sur pied
starting up
newspapers
0.00
0.00
0.00
0.00
publications
0.00
0.00
titles
0.00
0.00
0.00
0.00
institutions
0.00
0.00
0.00
0.00
march
crops
marche à pied
marche à pied
marche à pied
station balnéaire
station éolienne
suspension taxes
tarifs vols
start up phase
and in anticipation of a main
phase
new publications
genitals
limbs
0.00
0.00
limbs
0.00
0.00
press
press
1.00
1.00
press
1.00
1.00
review
0.00
0.00
0.00
0.00
border crossing
0.00
0.00
0.00
0.00
Blast
crops
0.00
1.00
0.00
1.00
0.00
0.17
0.00
1.00
walking
walking
walking
tourist beach
wind farm
walking
walking
1.00
1.00
0.00
0.00
0.50
1.00
1.00
0.00
0.00
0.50
0.00
0.00
0.00
0.00
0.50
0.00
0.00
0.00
0.00
1.00
tax suspension
air fare
tax suspension
fares have suddenly
1.00
0.00
1.00
0.00
1.00
0.00
1.00
0.00
Jandía wind
Blast
and compensate
the
farmers
whose crops
cycling and
be played by
cycling and
the Jandía wind
farm
tax suspension
fares have suddenly
Table 11: Evaluation sample of term alignment by word_align and seq_align
Algorithm
word_align
seq_align
Prec
0.59
0.51
Rec
0.54
0.60
F
0.56
0.55
Table 12: Average quality of term alignment by word_align and seq_align
57
5
CONCLUSIONS AND FUTURE WORK
One of the problematic issues in bilingual terminology extraction and detailed alignment has been the identification of term boundaries, which is a pre-condition for the
compilation of a term-level bilingual glossary and a corresponding concordance. Most
authors have solved the problem by applying language-specific monolingual syntactic
analysis to at least one of the text halves. This approach has yielded very impressive
results, but could not provide a generic, language-independent solution for the problem.
The seq_align algorithm, presented in the current dissertation, had been initiated in order to supply such a general solution, especially for cases where a highquality parsing is not available. For this purpose, we had taken the well-known
word_align algorithm (Dagan, Church & Gale, 1993) as a basic model which would
be developed towards a syntax-independent algorithm for the treatment of multi-word
units (MWUs).
Unlike word_align, which requires monolingual syntactic analysis in order to
compile a bilingual multi-word glossary (as done by Termight (Dagan & Church,
1994, 1997)), the seq_align method does not make any presumptions on the availability of such knowledge. As the evaluation of the bilingual dictionary shows, a useful
glossary can be induced regardless of syntactic considerations. When applied on the
EM process‘ output, the noise-cleaning algorithm yields a comprehensive and precise
glossary of the principal terms which appear in the given text using statistical data only. This glossary also indicates the exact location of potential breaks in the target
compounds, which is a very helpful information for a translator.
Both word_align and seq_align have achieved reasonable qualities of detailed
alignments. Though being assigned approximately the same average grade, it has
58
come out that word_align is slightly advantageous in terms of precision, whereas
seq_align is favourable in terms of recall. It should be noted that the aggregate
equivalence of the two algorithms has been reached despite of the much greater number of candidates considered by seq_align, which could have had a serious effect on
the quality of its output. Recall, however, that this achievement is strongly related to
the removal of affix stop words from the source entries of the inverted dictionary (see
Subsection 3.3.2).
The results of the comparison of word_align and seq_align with other alignment systems on the basis of the ARCADE project data suggest the superiority of Hiemstra’s model (Hiemstra, 1996) used by the XEROX system, over IBM’s Model 2,
used by our algorithms. Seemingly, this supports the claim raised by several authors
that a non-directional model represents the relations between parallel texts better than
a directional one.
In any case, the basic idea of seq_align, i.e. estimating the model’s parameters
for all valid sequences while considering length relations, is applicable to Hiemstra’s
model too. Such an extension of Hiemstra’s method is likely to provide reliable MWU
alignment for many language pairs without using any syntactic knowledge.
59
REFERENCES
Ahrenberg, L., Andersson, M. & Merkel, M. (1998). A Simple Hybrid Aligner for
Generating Lexical Correspondences in Parallel Texts. Proceedings of 36th
Annual Meeting of the Asso-ciation for Computational Linguistics and 17th International Conference on Computational Linguistics, Montréal, Canada, 10–
14 August 1998, 9-35.
Ahrenberg, L., Andersson, M. & Merkel, M. (2000). A knowledge-lite approach to
word alignment. Parallel Text Processing (Véronis, J., Ed.), 97–116. Dordrecht, Kluwer Academic Publishers.
Baum, L. E. (1972). An inequality and an associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities,
3, 1-8.
Brown, P. F., Della Pietra, S., Della Pietra, V. J. & Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational
Linguistics, 19 (2), 263–311.
Brown, P. F., Lai, J. C. & Mercer, R. L. (1991). Aligning Sentences in Parallel Corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, 169–176.
Bourigault, D. (1992). Surface grammatical analysis for the extraction of terminological noun phrases. Proceedings of the 14th International Conference on Computational Linguistics (COLING’92), Nantes, France, 977–981.
Choueka, Y., Conley, E. S. & Dagan, I. (2000). A comprehensive bilingual word
alignment system : application to disparate languages: Hebrew and English.
Parallel Text Processing (Véronis, J., Ed.), 69–96. Dordrecht, Kluwer Academic Publishers.
Church, K. W. & Gale, W. A. (1991). Concordances for Parallel Text. In Using Corpora: Proceedings of the Eighth Annual Conference of the UW Centre for the
New OED and Text Research (Oxford, September 29 – October 1, 1991), 40–
62.
Cover, T. M. & Thomas, J. A. (1991). Elements of Information Theory. New York:
John Wiley & Sons, Inc..
Dagan, I. & Church, K. W. (1994). Termight: Identifying and translating technical
terminology. Proceedings of the 4th Conference on Applied Natural Language
Processing, 34–40.
Dagan, I. & Church, K. W. (1997). Termight: Coordinating humans and machines in
bilingual terminology acquisition. Machine Translation, 12 (1–2), 89–107.
Dagan, I., Church, K. W. & Gale, W. A. (1993). Robust bilingual word alignment for
machine aided translation. Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1–8.
Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie :
statistiques lexicales et fitres linguistiques. Unpublished doctoral dissertation,
Université de Paris VII.
Daille, B., Gaussier, E. & Langé, J.-M. (1994). Towards automatic extraction of
monolingual and bilingual terminology. Proceedings of the 15th International
Conference on Computational Linguistics (COLING’94), Kyoto, Japan, 712–
716.
60
Debili, F. & Sammouda, E. (1992). Appariement des Phrases de Textes Bilingues.
Proceedings of the 14th International Conference on Computational Linguistics (COLING’92), Nantes, France, 517–538.
Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
39 (B), 1–38.
Fung, P. & Church, K. W. (1994). K-vec: A New Approach for Aligning Parallel
Texts. Proceedings of the 15th International Conference on Computational
Linguistics, Kyoto, 1096–1102.
Gale, W. A. & Church, K. W. (1993). A program for aligning sentences in bilingual
corpora. Computational Linguistics, 19 (3), 75–102.
Gaussier, É., Hull, D. & Aït-Mokhtar, S. (2000). Term alignment in use : Machineaided human translation. Parallel Text Processing (Véronis, J., Ed.), 253–274.
Dordrecht, Kluwer Academic Publishers.
Haruno, M. & Yamazaki, T. (1996). High-performance bilingual text alignment using
statistical and dictionary information. Proceedings of the 34th Annual Meeting
of the Association for Computational Linguistics (ACL’96), Santa Cruz, California, 131–138.
Haruno, M. & Yamazaki, T. (1997). High-performance bilingual text alignment using
statistical and dictionary information. Journal of Natural Language Engineering, 3 (1), 1–14.
Hiemstra, D. (1996). Using Statistical Methods to Create a Bilingual Dictionary, Unpublished Master's thesis, Universiteit Twente.
Jacquemin, C. (1991). Transformation des noms composés. Unpublished doctoral
dissertation, Université de Paris VII.
Johansson, S., Ebeling, J. & Hofland, K. (1996). Coding and aligning the EnglishNorwegian parallel corpus. In Aijmer, K., Altenberg, B., Johansson, M. (Eds),
Languages in Contrast. (Papers from a Symposium on Text-based Crosslinguistic Studies, 4–5 March 1994, pp. 85–112). Lund : Lund University
Press.
Kay, M. & Röscheisen, M. (1988). Text-translation alignment. Technical Report.
Xerox Palo Alto Research Center.
Kay, M. & Röscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19 (1), 121-142.
McEnery, A. M., Langé, J.-M., Oakes, M. P. & Véronis, J. (1997). The exploitation
of multilingual annotated corpora for term extraction. Corpus Annotation:
Linguistic Information from Computer Text Corpora (Garside, R., Leech, G. &
McEnery, A. M., Eds.), 220–230. London, Addison Wesley Longman.
Melamed, I. D. (1996). Automatic detection of omissions in translations. Proceedings of the 16th International Conference on Computational Linguistics
(COLING’96), Copenhagen, 764–769.
Melamed, I. D. (1997a). Automatic discovery of non-compositional compounds in
parallel data. Proceedings of the 2nd Conference on Empirical Methods in
Natural Language Processing (EMNLP'97), Providence, RI, 7-108.
Melamed, I. D. (1997b). A word-to-word model of translational equivalence. Proceedings of the 35th Conference of the Association for Computational Linguistics (ACL'97), Madrid, 490-497.
Simard, M., Foster, G. F. & Isabelle, P. (1992). Using cognates to align sentences in
bilingual corpora. Proceedings of the Fourth International Conference on
61
Theoretical and Methodological Issues in Machine Translation (TMI), Montréal, Canada, 25–27 June 1992, 67–81.
Smadja, F. A. (1992). How to Compile a Bilingual Collocational Lexicon Automatically. Proceedings of the AAAI Workshop on Statistically-Based NLP Techniques.
Smadja, F. A. (1993). Retrieving collocations from text : Xtract. Computational Linguistics, 19 (1), 143–177.
Smadja, F. A., McKeown, K. R. & Hatzivassiloglou, V. (1996). Translation collocations for bilingual lexicons: a statistical approach. Computational Linguistics,
22 (1), 1–38.
Véronis, J. & Langlais, P. (2000). Evaluation of parallel text alignment systems : The
ARCADE project. Parallel Text Processing (Véronis, J., Ed.), 369–388.
Dordrecht, Kluwer Academic Publishers.
62
Download