BAR-ILAN UNIVERSITY Seq_align : A Parsing-Independent Bilingual Sequence Alignment Algorithm EHUD S. CONLEY Submitted in partial fulfilment of the requirements for the Masters Degree in the Department of Computer Science, Bar-Ilan University Ramat-Gan, Israel 2002 ACKNOWLEDGEMENTS In the first place, I would like to express my deep thanks to Dr. Ido Dagan, who advised and guided me towards the completion of this research. I am convinced that the qualities of both the results and the dissertation itself have been substantially contributed by his sharp and deep insight. I would like to hereby acknowledge Prof. Jean Véronis of the University of Marseille, France, who provided me with the full data of the ARCADE project and has been ever ready to supply any helpful piece of information. Many thanks as well to Mr. Michel Simard of the University of Montreal, Canada, and to Mr. Éric Gaussier and Mr. David Hull of the Xerox Research Centre Europe, Grenoble, France, for their kind and fruitful correspondence with me regarding their works. I would also like to acknowledge Prof. Achim Stein of the University of Stuttgart, Germany, for his aid in the issue of lemmatisation. Finally, I would like to appreciate Prof. Shemuel (Tomi) Klein and Prof. Amihood Amir of the Department of Computer Science in Bar-Ilan University, who contributed of their time, experience and patience to help me with certain important aspects of the algorithms’ implementation. I am also grateful to my colleagues Zvika Marx and Yuval Krymolowski, who have been always glad to assist in any possible manner. TABLE OF CONTENTS ABSTRACT ............................................................................................................................................ 4 1 INTRODUCTION ......................................................................................................................... 6 2 BACKGROUND AND RELATED WORK ............................................................................... 9 3 4 5 2.1 MACHINE-AIDED HUMAN TRANSLATION ................................................................................ 9 2.2 DICTIONARY INDUCTION AND WORD ALIGNMENT ................................................................ 12 2.3 MULTI-WORD UNIT ALIGNMENT ........................................................................................... 15 2.4 THE WORD_ALIGN ALGORITHM ............................................................................................ 19 THE SEQ_ALIGN ALGORITHM ........................................................................................... 24 3.1 THE EXTENDED MODEL ........................................................................................................ 24 3.2 CANDIDATE SELECTION ........................................................................................................ 27 3.3 IMPROVEMENT OF THE DICTIONARY ..................................................................................... 28 RESULTS AND EVALUATION ............................................................................................... 35 4.1 EVALUATION METHODOLOGY .............................................................................................. 35 4.2 THE TEST CORPUS................................................................................................................. 37 4.3 DICTIONARY EVALUATION ................................................................................................... 42 4.4 ALIGNMENT EVALUATION .................................................................................................... 48 CONCLUSIONS AND FUTURE WORK ................................................................................ 58 REFERENCES ..................................................................................................................................... 60 3 ABSTRACT This dissertation presents the seq_align algorithm, an extension of the word-level alignment word_align algorithm (Dagan, Church & Gale, 1993). The word_align algorithm tries to find the optimal alignment of single words across two parallel texts, i.e. a pair of texts where one is a translation of the other. The seq_align algorithm is intended to do the same for both single- and multi-word units (MWUs). With a difference from other methods for MWU alignment, seq_align does not assume any syntactic knowledge. Rather, it uses statistical considerations as well as primitive lexical criteria to select the candidate word sequences. The basis of the extension to multi-word sequences is the view of each text as the set of all word sequences existing within this text, up to a pre-specified maximal length. The probability of a target-language candidate to be a translation of a given source-language candidate is measured through an iterative process similar to that performed by word_align, but this time all candidate sequences (rather than just single words) are judged. An additional set of length probabilities is integrated into the process as to take into consideration the length relations between the matching sourceand target-language sequences. In practice, most candidate sequences are invalidated in advance according to a set of simple rules, using only short lists of the function words in each language. This cleanup phase enables handling a reasonable amount of candidates as well as focusing on the more significant ones. The output of the iterative process is a probabilistic bilingual dictionary of multi-word sequences, which can be used to align the parallel text at the sequence level. However, this dictionary is quite noisy, especially because of the incapability of the statistical model to choose between sequences where one is a sub-sequence of the 4 other. A heuristic noise-cleaning algorithm is suggested to overcome this obstacle. The existence of redundant affix function words is approached through an additional elementary algorithm. The experimental results are based on the data of the ARCADE project (Véronis & Langlais, 2000). The evaluation of these results shows that a syntaxindependent algorithm can yield a highly-reliable bilingual glossary. Another finding of the research is that both the word_align and the seq_align algorithms, based on a directional statistical model, cannot do as well as the Xerox method (Gaussier, Hull & Aït-Mokhtar, 2000), which is based on a non-directional model. However, it seems that the principles of the seq_align extension are applicable to a method of this kind as well. 5 1 INTRODUCTION According to the Merriam-Webster online English dictionary, the term wind farm was created on 1980. Imagine a translator at the end of 1980, trying to translate a new English document, dealing with wind-activated electrical generators, into the French language. Even the most comprehensive and updated English/French dictionary, printed on the same year, could not provide him with the commonly-accepted French parallel of wind farm. It is not inconceivable to suppose that our imaginary fellow, who was not a great expert in electricity, had furnished himself with at least one previously-created English/French pair of parallel documents on a related topic. In that situation, he is beginning to scan the English version of such a document in order to find an occurrence of the source term, wind farm. When he finally finds what he was seeking for, he has to read the parallel section of the French version in order to reveal that a wind farm is termed in French station éolienne. Apparently, he could have never guessed that. Even if the parallel text is available electronically and is already aligned on the paragraph- or sentence-level, a repetition of this process for dozens of terms involves a high consumption of time and energy. Evidently, it could be much easy and efficient if a glossary of the electricity domain was at hand. A collection of parallel text segments aligned at the term level and accessible through the glossary’s entries could further help the translator in understanding the context in which each term is used. Within a computerised translation-aid system, clicking on the glossary entry wind farm and choosing station éolienne, the translator may be shown the following pair of text segments: 6 The Community's position vis-à-vis the Jandía wind-farm Position de la Communauté européenne concernant la station éolienne de Jandía Figure 1: Parallel text segments where the terms wind farm and station éolienne appear as mutual translations. The aligned terms are highlighted in a bolded font. The text is an excerpt from the JOC corpus of the Parliament of the European Union In order to be able to supply such resources for many domains while staying up-to-date, automatic induction tools are needed. Such tools have been developed since the early 1990s, based on various statistical measuring techniques. However, the recognition of the boundaries of multi-word terms has always been done using language-specific tools for syntactic parsing, which have knowledge about the typical structures of meaningful multi-word units (MWUs) in the discussed language. Though these techniques have yielded quite good results, they are not easily portable to other languages for which syntactic parsing is not available in a sufficient quality. Additionally, parsers sometimes do not include the definitions of rare structures, which necessitate the use of a pre-specified list of exceptions. Consequently, arriving at a point where a syntax-independent algorithm can provide results of a quality similar to that attained using parsing has remained a valuable objective. The current dissertation presents the seq_align algorithm, a purely statistical method for MWU alignment, based on the prominent word_align algorithm (Dagan, Church & Gale, 1993). As an algorithm for word-level alignment, word_align per se cannot supply a MWU glossary. Hence, its output should always be processed using the output of a monolingual parser, as done in the Termight method (Dagan & Church, 1994, 1997). The new seq_align algorithm works on the sequence level, rather than the word level, testing together single- and multi-word sequences in order to identify their counterparts. Quite surprisingly, this novel method has managed to produce a rather highquality glossary, having an average entry coverage rate of 90% and above 71% of en- 7 tries supplying a full translation. Furthermore, almost all of the entries can be used to reach the full translation through a corresponding detailed sequence-level alignment. The quality of the detailed alignments of both word_align and seq_align has been evaluated using the evaluation data of the ARCADE project (Véronis & Langlais, 2000). Both algorithms achieved the same average grade of 54% (computed using the F-measure), though there have been slight differences in terms of precision and recall. This shared quality is lower than those achieved by some of the systems which originally participated in the ARCADE project, but is still of an acceptable level. The capabilities of the algorithms in term-level alignment have been tested on a subset of the ARCADE sample, indicating a similar quality. The dissertation consists of four principal parts: Chapter 2 gives a survey of background and related work, including a full description of the original word_align algorithm. Chapter 3 describes the seq_align algorithm itself as well as the methods used for the selection of candidate sequences and for the improvement of the dictionary’s quality. Chapter 4 presents the evaluation of the results, both for the glossary and for the detailed alignment. Finally, Chapter 5 discusses the conclusions of the research and points to potential future work. 8 2 BACKGROUND AND RELATED WORK This chapter is divided into four sections. Section 2.1 is a general survey of machineaided human translation. Section 2.2 is an overview of methods for automatic dictionary induction and detailed alignment. Section 2.3 discusses the problem of multi-word unit alignment and the different approaches towards its solution. Finally, Section 2.4 gives a succinct description of the word_align algorithm (Dagan, Church & Gale, 1993), which is the concrete basis of the currently proposed seq_align algorithm. 2.1 Machine-aided human translation The task of translating a text from one language to another has been known as problematic since ancient days. One of the major difficulties in its performance is in finding the most suitable terminology within the discussed context and using it properly. Specialised terms, commonly found in technical documents, should be translated not only suitably but also consistently. In the past, bilingual glossaries (i.e. collections of specialised terms with their translations) did not exist for most areas, not to mention organised collections of translation examples. Thus translators had to spend many hours in reading and searching over relevant texts in both source and target languages in order to extract the correct terms and study their proper usage. In the last few decades, there has been a significant increase in the variety of domains of translated texts as well as in the number of language pairs dealt with. Additionally, the quantities of materials to be translated have been growing in an accelerated pace. All these factors motivated serious efforts to develop automatic tools capable of inducing bilingual glossaries and supplying relevant translation examples. 9 Imitating human learning, these tools try to deduce linguistic knowledge from bilingual parallel texts, i.e. pairs of texts where one is a translation of the other. For instance, given an English/French text discussing computers hardware, a glossaryinduction tool is expected to find out that Random Access Memory, appearing in the English text, is translated in the French text as mémoire vive. That is while random and access by themselves are normally translated into aléatoire and accès, respectively. Given a pair of word sequences, a bilingual concordancing tool would display some parallel text segments (for instance, sentences or paragraphs) where these sequences appear and are likely to translate each other. The concordance should also be able to indicate other correspondences between single- or multi-word units within each pair of aligned segments. The task of identifying high-level segment correspondences is referred to as rough alignment, whereas that of mapping word-level connections is called word alignment or, more generally, detailed alignment. Figure 2 is an example of a partial detailed alignment. During the past decade, significant efforts have been invested in the development of algorithms for automatic induction of bilingual glossaries as well as for both levels of alignment. Aside from its importance per se, rough alignment is required for the glossary induction and the detailed alignment algorithms in order to function well. That is because these algorithms are all based on statistical methods which are very sensitive to slight deviations.1 Therefore, they need a suitable rough alignment to focus them on relatively short parallel text-zones. 1 A linear alignment could have been used as rough alignment, meaning that the expected parallel of the ith word of text S would be the jth word of text T such that j = LT / LS i, where LS and LT denote the 10 Je suis convaincu que chacun de mes collègues présents aujourd'hui à I believe that all of my colleagues presently sitting in --la Chambre des communes aimerait avoir l'occasion de proclamer haut the House of Commons would like a chance to individually go on record --et fort, devant ses électeurs canadiens, sa fierté d'être and officially tell their constituents that they are proud to be --Canadien. Canadians. Figure 2. An example of a partial English/French detailed alignment. The upper line in each line-system consists of the French text, while the lower line contains the English parallel. The arrows indicate the correspondences between the two texts. The text was excerpted from the Canadian Hansards bilingual corpus which documents the debates of the Canadian parliament. In many cases, texts are translated sentence-by-sentence, keeping also the original sentence order. Some parallel texts of that type even include sentence alignment mark-ups inserted during the translation process. When such a partition is not indicated explicitly, it can be obtained rather easily by applying one of the many accurate algorithms developed for automatic sentence alignment (for example, Kay & Röscheisen, 1988, 1993; Church & Gale, 1991; Gale & Church, 1993; Brown, Lai & Mercer, 1991; Debili & Sammouda, 1992; Simard, Foster & Isabelle, 1992; Haruno & Yamazaki, 1996, 1997; Johansson, Ebeling & Hofland, 1996). Sentence alignment is considered a high-quality initial rough alignment for the detailed alignment/glossary induction algorithms. In some other cases, however, this level of parallelism does not exist, either due to the translator’s preferences or because of different natures of the two languages aggregate lengths of S and T, respectively. In fact, the real translations of words in parallel texts deviate from this diagonal. Alternatively assuming a large search space to overcome these deviations significantly lowers the accuracy of detailed alignment algorithms. 11 in terms of grammar and style. In such cases, a rather satisfactory substitute for sentence alignment might be a set of highly reliable pairs of matching word occurrences, referred to as anchor points, which can be deduced from an unaligned parallel text (see for example Melamed, 1996, 2000; Fung & McKeown, 1997; Choueka, Conley & Dagan, 2000). Indeed, glossary induction and detailed alignment algorithms are strongly correlated. On one hand, any detailed alignment algorithm is based on a suitable bilingual dictionary (see below). On the other hand, a bilingual glossary can be compiled rather simply using the local connections indicated by a detailed alignment output. Therefore, bilingual detailed alignment and glossary induction are concerned as highly dependent tasks. 2.2 Dictionary induction and word alignment Naturally, a detailed alignment algorithm must utilise a bilingual dictionary corresponding to the aligned text pair in order to determine the most probable translation of each textual unit. As bilingual glossaries are rarely available, this information must be induced by the systems themselves. Different authors proposed various techniques for the acquisition of these lexicographic data. These methods can be divided into two sorts: (a) Single-pass measurements, and (b) Iterative processes. Single-pass measurements means applying certain statistically-based measures to the counts of unit occurrence and co-occurrence within the text pair, obtained by scanning the text pair once. According to the acquired counts, a relative score is assigned to each pair of units (words or phrases), such that corresponds to the likelihood of one unit to be a valid translation of the other. The measures used by such methods include the well-known Mutual Information measure (Cover & Thomas, 1991), the Dice score (Smadja, 1992) and the T-socre (Fung & Church, 1994). The Linköping 12 Word Aligner (LWA) (Ahrenberg, Andersson & Merkel, 2000) is an example for an alignment system based on this type of measurement. The iterative processes are actually specific bootstrapping algorithms, in which translational equivalence is measured by repetitively scanning the text pair, each time calculating rectified estimates using those attained at the end of the previous iteration. Each of these algorithms is based on a certain statistical model, which relates to the parallelism existing between the two parts of the bilingual text. Current algorithms are variants of one of the following basic models: IBM’s statistical models, suggested by Brown et al. (1993). The RALI sys- 1. tem, which participated in the ARCADE project (see below in Subsection 4.4.1), used lexical data acquired through IBM’s Model 1.2 A modification of Model 2, named word_align, was suggested by Dagan, Church & Gale (1993), and is the basis of the current research (see a detailed description of word_align in Section 2.4). 2. Hiemstra’s model (Hiemstra, 1996), adopted by Melamed (1997b) and by Gaussier, Hull & Aït-Mokhtar (2000). The fundamental difference between these two model families is in the aspect of directionality. The IBM models associate unequal roles to the two halves of the bilingual text. One half, referred to as the target text, is assumed to have been generated from the other text, regarded as the source text. Each target-text word, t, is assumed to have been produced by at most one source-text word, s. This does not avoid s from being the origin of other target words. Hence, IBM models are considered directional models. As opposed to these assumptions, Hiemstra’s model relates to the two text halves symmetrically, enabling each word in one text to be aligned with at most one 2 Personal communication. 13 word in the other text. This non-directional assumption is also referred to as the oneto-one assumption. Cases of one-to-many (and many-to-many) alignments are treated only at the alignment phase (see below in Section 2.3). Once the bilingual dictionary was induced through one of these methods, a detailed alignment may be generated. All the algorithms mentioned above share the principle of choosing the best translation for each unit considering the acquired dictionary. Some of them first produce all possible alignments accompanied by their relative scores. Then, the best pairs are iteratively chosen while eliminating other pairs where either the source or target unit appears (e.g. Gaussier, Hull & Aït-Mokhtar, 2000; Ahrenberg, Andersson & Merkel, 2000). Other methods simply connect the best source unit to each target unit (Brown et al., 1993; Dagan, Church & Gale, 1993), thus preserving the option of one-to-many alignment. One way or the other, a pre-defined threshold is always applied to filter out lowly-scored, usually erroneous connections (i.e. pairs of aligned units). Additionally, certain positional probabilities are estimated and integrated within the score of each candidate connection in order to take into consideration also the relative location of the two units (as indicated by the initial rough alignment). The translation of function words (i.e. words other than nouns, verbs, adjectives and adverbs) is generally inconsistent. Therefore, ignoring these units already at the dictionary induction phase is a common practice of most methods. Furthermore, low-frequency phenomena are sometimes more confusing than helpful for the process, thus each method has its own criteria for their elimination. As in any text processing, raw versions of parallel texts must be adjusted before the initiation of the bilingual process. The first pre-processing step, common to both monolingual and bilingual tasks, is tokenisation, i.e. separating punctuation 14 marks and other symbols from the words with which they are concatenated. The resulting blank-delimited units are referred to as the text’s tokens. The term words within the context of parallel text processing regularly means the texts’ tokens determined by this handling. Another important treatment is a certain level of monolingual morphological analysis through which different inflections of the same word are replaced by a common base form. Some systems apply a full morphological analysis or lemmatisation (see Choueka, Conley & Dagan, 2000), while others satisfy with a heuristic stemming (e.g. Ahrenberg, Andersson & Merkel, 2000). These preliminary operations unify the various forms of basically identical words observed within the text, thus improving the algorithm accuracy. 2.3 Multi-word unit alignment The easiest case of detailed alignment is where each word of the source text is translated into exactly one word in the target text. A slightly harder case is where a few source words are not translated at all, whereas some target words have no specific “origin”. In these two cases the one-to-one assumption is valid, hence there is no need to extend or modify any of the basic methods mentioned above (in Section 2.2) in order to accomplish fairly good results. Real-life texts, however, do not completely conform with this assumption. Sometimes a sole word is translated into a few words or vice versa, and moreover, some source-language multi-word units (MWUs) are translated non-literally into target-language MWUs. Therefore, the correct modelling of the problem is feasible when the words of the non-literally translated MWUs are fused into an atomic unit. A further complexity of the problem is that not infrequently a MWU is not successive within the text. Rather, additional words, such as adjectives, adverbs and 15 pronouns, are often inserted between the basic words of an expression (e.g. centre national de recherche [national research centre]). Idiomatic expressions are sometimes even divided into several sub-units appearing at relatively distant parts of the sentence. The order of the words might also vary due to stylistic or grammatical considerations (e.g. conversion to the passive voice). This variety of potential modifications make it very complicated to recognise the basic equivalence of different forms of one expression. Trying to process all possible combinations of words, even within sentence boundaries, would increase the complexity of the problem to exponential in the length of the text (or sentence) while probably avoiding any significant statistical data. Therefore, most researchers concluded that it is essential to apply a certain level of syntactic analysis on the text in order to identify the most likely candidate MWUs and recognise their skeletons (principal words). This type of analysis is commonly considered a language-specific monolingual procedure. That is because the structures of expressions in each language aree substantially different. Hence, a cross-language generalisation is apparently impossible. As not every incidental phrase is likely to be an idiom, some methods combine the linguistic analysis with some statistical measures. The linguistic component in all methods is based on the output of shallow parsers, implemented through manuallydefined regular expressions or local grammars (Jacquemin, 1991; Bourigault, 1992; Smadja, 1993; Daille, 1994). This sort of techniques was applied with some degree of success to bilingual alignment or glossary induction (e.g. Daille, Gaussier & Langé, 1994; Smadja, McKeown & Hatzivassiloglou, 1996; McEnery, Langé, Oakes & Véronis, 1997). 16 The MWU mark-ups can be integrated either into the dictionary induction or the detailed alignment phase. The former option might be interpreted in two manners: (a) When a candidate MWU is suggested, the whole MWU is considered while ignoring the constituent words, or (b) The candidate MWUs is considered in addition to the single-word tokens within the text. According to the first definition, if the translation of the fused MWU was compositional (literal), the statistics of the elemental words or sub-units are distorted. The second definition, however, suggests that the decision whether a MWU is translated literally or non-literally will be a consequence of the dictionary induction process. That is, the most probable translation for each token, either a single- or a multiword unit, will be chosen. Nevertheless, judging additional candidates might lower the precision of the induction process. The latter approach was fully adopted by Ahrenberg, Andersson & Merkel (2000). Melamed (1997a) suggested a specific bootstrapping algorithm for the recognition of non-compositional compounds within a parallel text. This algorithm intends to solve the problem raised above regarding the former approach. It applies the Mutual Information measure to compare different versions of the texts. Practically, this involves a somewhat full modelling of the text parallelism, which could have maybe been exploited for the induction of a full dictionary. As stated above, it is feasible to integrate the preliminarily-detected MWUs just at the detailed alignment phase, meaning that the dictionary is induced solely for single words. The MWUs are aligned according to the equivalence of one or more pairs of elemental tokens, as determined by the single-word alignment algorithm. This technique is quite suitable for language pairs where most expressions are translated literally (i.e. where the one-to-one assumption is valid) as in English/French. None- 17 theless, it cannot do very well in cases of non-literal translation, not to say when the languages tend to frequently use multi-word idioms (like in Chinese and Hebrew), which are rarely translated literally into other languages. The Termight method, suggested by Dagan & Church (1994, 1997), utilises the output of the word_align algorithm to find translations of noun-phrase terms. The candidate source-language terms are proposed by a simple parser and then refined by a human judge. The suggested translations are simply the concatenation of the words residing between the leftmost and rightmost translations of the source phrase elements. Intermediate function words, if any, are therefore included automatically within the translation, as well as possible other inserted words, which do not necessarily originate from the source phrase. Gaussier, Hull & Aït-Mokhtar (2000) proposed to apply the parsing stage to both halves of the parallel text. Then, following the one-to-one alignment process, the probabilities of connections involving MWUs are estimated by multiplying the elements’ connection probabilities. This type of connection is tried only if a certain syntactic constraint is satisfied.3 A similar method was used by the RALI system (see footnote 2 in p. 10) in the ARCADE project, but without imposing any syntactic constraint. By definition, methods based on monolingual parsing are language-specific and absolutely depend on the existence of a suitable tool for the concerned language. Building this sort of tools for some languages is a complicated issue due to their morphological richness, writing system or the like. Moreover, even if a basic tool has al- 3 A connection involving a MWU is permitted only if the principal word of one unit, i.e. the head noun (for noun phrases) or the main verb (for verb chunks), is aligned with the principal word of the candidate translation. 18 ready been made available, some idioms have irregular structures that obligate its further development or even preparing a list of exceptions. Therefore, though exploiting monolingual parsing is expected to yield better results, there is still a genuine necessity in purely statistical, parsing-independent algorithms for detailed alignment. The current research has taken the modest challenge of extending one existing method for single-word alignment to handle MWUs without utilising any syntactic parsing. As a very preliminary and short-term research, it ought to concentrate on a relatively restricted, feasible sub-task. The alignment of contiguous multi-word sequences seemed an objective of that quality. The number of sequences up to a prespecified constant length is linear in the text’s length, which ensures a nonexponential complexity. Additionally, a lot of idioms in many languages are contiguous. The word_align algorithm (Dagan, Church & Gale, 1993) was taken as the core single-word alignment method. The basic idea of the presented extension is to exploit the “centrifugal” property of this method’s dictionary induction phase. In other words, to take advantage of the process’ nature to gradually augment the probabilities of the better translations while diminishing the rest. The suggested method is titled seq_align (following its principal origin, word_align). The following section gives a concise survey of the word_align algorithm. The seq_align algorithm is introduced and discussed in details in the rest of this dissertation. 2.4 The word_align algorithm As mentioned above, the seq_align algorithm is an extension of the word_align algorithm (Dagan, Church & Gale, 1993). Similarly, the latter algorithm is a modification of a previous algorithm—the IBM’s Model 2, proposed a bit earlier by Brown et al. (1993). This section gives a survey of the word_align version of Model 2 in order to 19 supply the reader with the appropriate background for the understanding of the seq_align extension. The input of word_align is a parallel text, accompanied by a corresponding rough alignment (either a sentence alignment or a list of alignment anchor points). In an iterative manner, the algorithm induces a probabilistic bilingual dictionary which corresponds to the given text, as well as an additional set of estimated values (see below). Then, within another pass over the text pair, it tries to assign an optimal source word to each target word using the acquired probabilistic estimates. The following notations shall be used in this dissertation for the description of the word_align and the seq_align algorithms: S, T—the source and target texts, respectively. si, tj—the source- and target-text words located at positions4 i and j, respectively. I—the initial input rough alignment corresponding to the text pair <S,T>. I(j)—the source position corresponding to the target position j according to I. As stated above (in Section 2.2), the word_align algorithm assumes a direc- tional translation model. More specifically, each target-text token tjT is assumed to have been generated by exactly one source-text token si{S s0, s0 = NULL}. The NULL token is considered the origin of target tokens that have no source parallel. Some source tokens, however, may be left unaligned. An alignment, a, is defined as a set of connections, where a connection <i,j> denotes that position i in the source text is aligned with position j in the target text. 4 Position—the location of a word relative to the beginning of the text. 20 The statistical model assumed by word_align describes the probability that T is the translation of S by the equation: Pr(T | S ) Pr(T , a | S ) )1( a where a ranges over all possible alignments. The probabilistic bilingual dictionary is generated through the EstimationMaximisation (EM) technique (Baum, 1972; Dempster, Laird & Rubin, 1977), in accordance with the assumed probabilistic model.5 Given S, T and I as inputs, two sets of parameters are estimated: 1. pT(t|s)—Translation Probability—the probability that the target-language word t is the translation of the source-language word s. 2. pO(k)—Offset Probability—the probability that an arbitrary token si, which is the real parallel translation of the token tj, is k words distant from sI(j)—its expected parallel according to the initial alignment (k = i – I(j)).6 Before the EM process is run, uniform values are given to all parameters. Then, in an iterative manner, the parameters are re-estimated until they converge to a local maximum or until a pre-specified number of iterations is reached. In the first step of each iteration, every target token tj is examined independently as the possible translation of a set of source candidate tokens. These candidates are basically the tokens found within a distance of w words from sI(j) where w is a pre-specified windowing range. In fact, some filters are applied to both target and 5 For more theoretical background, see (Brown et al., 1993; Dagan, Church & Gale, 1993). 6 A positive value of k means that si is located after the position at which it was expected to be found according to the initial rough alignment. A negative value of k indicates that si appears before the expected position. A zero value occurs when si is located exactly where it was expected to be found. 21 source candidates in order to improve the results.7 In addition, the allowed ratio between the frequencies of any connected source and target tokens is normally bounded as well. Yet, according to the probabilistic model, the probability of a connection <i,j>, which equals the sum of the probabilities of all alignments that contain this connection, is represented by Pr( i, j ) W ( i , j ) W ( i ' , j ) ) 2( i' where W ( i, j ) pT (t j | si ) pO (i I ( j )) and i' ranges over all source positions (in the allowed window). In the second step of each iteration, all pT and pO parameters are re-estimated using the Maximum Likelihood Estimate (MLE), given by the following equations: pT (t | s) pO (k ) i , j:t j t , si s Pr( i, j ) i , j:si s Pr( i, j ) i , j:i I ( j ) k Pr( i, j ) i , j Pr( i, j ) )3( )4( Both equations estimate the parameter values as relative “probabilistic” counts. The first estimate is the ratio between the probability sum for all connections aligning s with t and the probability sum for all connections aligning s with any word. The second estimate is the ratio between the probability sum for all connections with offset k and the probability sum for all possible connections. By the end of each iteration, all pT(t|s) smaller than a certain threshold are set to 0. The new estimates of pT and pO are used in the next iteration to re-compute the local connection probabilities, Pr(<i,j>). At the end of the iterative process, pT(t|s) smaller than a certain final filtering threshold are set to 0, leaving only the most relia- 7 For instance, stop words can be eliminated from the candidate lists of both texts. 22 ble <s,t> pairs. These pairs, together with their final probabilistic estimates, are considered the dictionary for the alignment phase. The optimal word alignment of the text is found based on the dictionary and the final pO estimates. Once again, the local W(<i,j>) values are computed (the same way they were determined at the first step of each EM iteration), but this time, each target token is simply assigned the most probable source token within its permitted window. In order to avoid erroneous connections, a threshold T is applied for each j requiring that max(W(<i,j>)) T, where i window of j, thus leaving some target tokens unaligned. As already established above, a few target tokens may be aligned with the same source token, but each source token may be connected with at most one target token. 23 3 3.1 THE SEQ_ALIGN ALGORITHM The extended model The seq_align algorithm is an extension of the word_align translation model. According to the extended model, each text is regarded as a set of token sequences, each of which is represented by its position and length (in tokens). Each of the two sequence sets includes all sequences of lengths [1,Lmax] that origin from each position throughout the text. Let S’ and T’ denote the source and target sequence sets, respectively. si,l and tj,m denote the source and target token sequences of lengths l and m that begin at positions i and j, correspondingly. According to the new model, each target sequence tj,mT’ is assumed to have been generated by exactly one source sequence si,l{S’ s0,0, s0,0 = NULL}. The NULL sequence is considered the origin of target sequences that have no source parallels. An alignment, a, is a set of connections, where a connection <i,l,j,m> denotes that the source sequence of length l beginning at position i is aligned with the target sequence of length m beginning at position j. The probability of a connection <i,l,j,m> is represented by the following formula: Pr( i, l , j , m ) W ( i , l , j , m Lmax W ( i ' , l ' , j , m ) i ' l '1 )5( where W ( i, l , j , m ) pT (t j ,m | si ,l ) pO (i I ( j )) p L (m | l ) and i’ ranges over all source positions (in the allowed window). pL is an additional set of parameters which estimate the probability that a source sequence of length l is translated into a target sequence of length m. The re-estimation of pT, pO and pL is performed in the same fashion the two former sets are treated in word_align: 24 pT (t | s) i ,l , j ,m:s si ,l ,t t j ,m Pr( i, l , j , m ) i ,l , j ',m'[1, Lmax ]:s si ,l Pr( i, l , j ' , m' ) )6( where j' ranges over all target positions. pO (k ) i ,l , j ,m:i I ( j )k Pr( i, l , j, m ) i ,l , j ,m Pr( i, l , j, m ) )7( i , j Pr( i, l , j , m ) i , j ,m '[1, Lmax ] Pr( i,l , j , m' ) )8( p L (m | l ) Resembling word_align, pT(t | s) is the ratio between the probability sum for all connections aligning s with t and the probability sum for all connections aligning s with any target sequence. pO(k) is the ratio between the probability sum for all connections with offset k and the probability sum for all possible connections. The additional length probability estimate, pL(m | l), is the ratio between the probability sum for all connections aligning source sequences of length l with target sequences of length m and the probability sum for all connections aligning source sequences of length l with target sequences of any length. The integration of the length probability estimate into the EM process is based on the common assumption that the mutual nature of any two languages involves a certain level of consistency as to the lengths of corresponding sequences. The data presented in Table 1 verify this assumption. Notice that long English sequences tend to be translated into relatively short French sequences. The linguistic explanation for this phenomenon is that prefix and suffix English units are often translated into infix French units thus splitting the basic expression into two short sequences (as demonstrated in Section 4.3). The detailed alignment is generated using a technique similar to that of word_align, but adapted to handle multi-word units: 1. For each tj,mT’, choose si,l such that: 25 arg max W ( i, l , j , m ) )9( i ,l where W ( i, l , j , m ) pT (t j ,m | si ,l ) pO (i I ( j )) 2. Eliminate all connections <i,l,j,m> for which W(<i,l,j,m>) < Th, where Th is a pre-specified threshold. For each connection <i,l,j,m>, produce m copies, associating a number j … 3. ( j + m – 1) to each copy. 4. Sort the expanded connection list by the above numbering field. 5. For each target position (denoted by the number), select the connection for which W(<i,l,j,m>) is maximal. m l 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.94 0.19 0.08 0.05 0.04 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.02 0.05 0.03 0.05 0.59 0.29 0.20 0.16 0.14 0.14 0.11 0.10 0.10 0.10 0.12 0.08 0.13 0.12 0.01 0.15 0.42 0.32 0.25 0.21 0.21 0.16 0.12 0.13 0.16 0.16 0.14 0.16 0.21 0.00 0.04 0.14 0.26 0.25 0.22 0.19 0.18 0.14 0.12 0.09 0.14 0.19 0.21 0.18 0.00 0.01 0.05 0.09 0.18 0.16 0.17 0.17 0.09 0.10 0.12 0.14 0.13 0.11 0.16 0.00 0.01 0.02 0.04 0.08 0.12 0.10 0.13 0.07 0.05 0.09 0.08 0.11 0.09 0.10 0.00 0.00 0.01 0.02 0.02 0.05 0.08 0.10 0.06 0.03 0.06 0.12 0.09 0.04 0.06 0.00 0.00 0.00 0.01 0.01 0.02 0.03 0.05 0.08 0.03 0.11 0.06 0.07 0.02 0.03 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.03 0.11 0.05 0.01 0.03 0.04 0.02 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.02 0.15 0.05 0.02 0.02 0.02 0.04 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.26 0.02 0.01 0.01 0.03 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.12 0.02 0.02 0.02 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.05 0.05 0.02 0.03 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.03 0.02 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.02 0.03 0.02 Table 1: Length probabilities computed for the English/French JOC corpus (see Section 4.2). l and m denote the lengths of the English and French sequences, respectively. The elements are pL(m | l). With a difference from the computation of W in the EM process, this time pL is not included. That is because the pT estimates already introduce the length probabilities (since the lengths of s and t are parts of their entities, unlike the positional offsets, which depend on the local constallation). In cases of identical translation and offset probabilities, pL can be used to break the tie. This alignment technique was tested among a few other methods, which generate a more coherent output, i.e. an alignment with no contradicting connections. However, for our evaluation data (see Section 0), this simplest version gave the best outcome. 26 3.2 Candidate selection Trying to process all possible sequences, even up to a relatively small length, would result in a very high complexity of time and space. The word_align algorithm, if not imposing any restrictions on candidate words and connections, would need time and space of O(NM), where N and M denote the number of tokens in the source and target texts, respectively. Applying seq_align in the same conditions would require O(Lmax2 N M). This theoretical situation motivates a serious effort to reduce the number of candidate sequences. In addition, the existence of certain kinds of candidates may introduce some noise into the results. These candidates, too, should be eliminated. The model itself already includes two important filters. The first is the w windowing parameter, which bounds the text area where source candidates for a given target candidate may reside. The frequency ratio limitations also focus the algorithm on more likely connections. As function words tend to be translated inconsistently, a common practice in the alignment field is to dismiss them in advance using a pre-defined languagespecific stop-word list. Working with multi-word sequences, we used such lists to eliminate any sequence which included only stop words. We refer to these units as stop-word sequences. Another kind of sequences unlikely to be an atomic unit is those containing punctuation marks. These, too, were ignored. The other two filters which we applied are related to the frequencies of the sequences within the texts. Sequences which appear only once within the text are problematic from two points of view: (a) they are very likely to be accidental, and (b) their alignment is unreliable because it cannot be verified using any other source. Hence, 27 we excluded them from the candidate list of each text, which reduced the sizes of these lists by more than 90%, due to data sparseness. The last filtering process refers to sequences which appear in a unique context. That is, sequences whose preceding or succeeding tokens are the same for all of their occurrences. If the longer sequences which include these prefix and/or suffix tokens are valid sequences, it is very difficult to determine which of the two sequences should be aligned with a candidate from the other text. For example, if the sequence school always appears within the context of high school, it is hard to know which of them is aligned with lycée. Though it is not an absolute truth, in most cases where a sequence appears only within a unique longer sequence, it is because the latter sequence is a meaningful compound. For that reason, we decided to dismiss all sub-sequences of that characteristic. This was done by comparing the frequency f of each sequence with those of all its sub-sequences and eliminating all sub-sequences whose frequency was equal to f. As the data in Section 4.2 show, this filter is a very powerful instrument for reducing the number of valid candidates. 3.3 Improvement of the dictionary 3.3.1 Noise-cleaning algorithm Consider the raw dictionary entry presented in Figure 1. It can be easily rec- ognised that each target sequence is either a sub-sequence of another or such that contains another. Obviously, two sequences where one is contained within the other are rarely correct translations at the same time. A further observation reveals that the correct translation is not assigned a higher probability than those of two wrong suggestions. In some other examples, the correct translation was given even the lowest probability relative to all other options. 28 Source local and regional authority Freqs 6 Target autorité local autorité local et autorité local et régional local et local et régional et régional le autorité local et régional le autorité local et Freqt 60 10 5 48 10 33 4 8 pT(t|s) 0.142533 0.142533 0.142533 0.122089 0.122089 0.120195 0.104451 0.103579 Figure 3: A sample raw dictionary entry sorted by translation probability A study of a few dozens of examples has led to the conclusion that there are three factors which are relevant to the selection of the correct translation. A balanced composition of these factors has resulted in the heuristic algorithm presented below. Suppose t1 and t2 are two target sequences that satisfy t1 t2, i.e. t1 is a subsequence of t2. Since unique-context sequences are eliminated from the candidate lists (as explained above in Section 3.2), t1 cannot have the same frequency of t2. Rather, t1 must be more frequent than t2, because any occurrence of t2 contains an occurrence of t1, but not vice versa. Consequently, a greater probability of t1 is not an absolute indication of its preference. On the contrary, a higher probability of t2 is certainly a good reason to favour it. As mentioned above (in Section 2.3), the ideal case of bilingual alignment is where each unit has exactly one parallel in the other half of the text. Though such a situation is a utopia, some units do satisfy this condition or are very close to do so. The empirical experience indicates that even in less perfect instances, a relatively small difference between the frequencies of the source and target units is a good clue for the correctness of a given translation. Nevertheless, this heuristic has shown to be a successful selection criterion only where the smaller difference belongs to the longer sequence (t2). In other cases, the probabilities have seemed to play a very instrumental role in making the correct decision. 29 The resulting heuristic states as follows: Given the dictionary entry of an arbitrary source sequence s, the frequency difference of each translation sequence t is defined as the difference between the frequencies of t and s. For each pair of translation sequences where one is a sub-sequence of the other, if the longer sequence has a smaller frequency difference, then it should be selected regardless of its probability. If the differences are equal, then the shorter sequence must be supported by a better probability in order to be favoured. When the shorter sequence has the smaller difference, then it should be chosen unless the longer sequence provides a probabilistic evidence for its superiority. The cleaning algorithm works in two phases. In the first phase, every pair of target sequences (which satisfy the containment condition) is tried. One of the candidates is invalidated using the heuristic rules while adding the identity of the other sequence to its list of “better translations”. In the second phase, the probabilities of the invalidated sequences are equally distributed between those sequences which “overcame” them, provided that those sequences have remained valid. A pseudo-code of the entire process is given below. Definitions s—the source sequence | t |—number of translations ti—the ith target sequence in the translation list pi = pT(ti | s) f(x)—the frequency of the sequence x di = | f(ti) – f(s) | Algorithm 1 2 3 4 CHOOSEANDINVALIDATE(i, j) invalidj = true; // invalidating j betterj = betterj {i}; // adding i to the list of “better translations” of j MAIN // First phase: invalidation for i = 1 to | t | { 30 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 for j = (i + 1) to | t | { if (invalidi && invalidj) continue; if (ti tj) (short, long) = (i, j); elsif (tj ti) (short, long) = ( j, i); else // No containment continue; if (dlong < dshort) CHOOSEANDINVALIDATE(long, short); elsif (dlong == dshort) if (pshort > plong) CHOOSEANDINVALIDATE(short, long); else CHOOSEANDINVALIDATE(long, short); else // dlong > dshort if (plong > pshort) CHOOSEANDINVALIDATE(long, short); else CHOOSEANDINVALIDATE(short, long); } // for j } // for i 28 29 30 31 32 33 34 35 36 37 38 // Second phase: probability distribution for i = 1 to | t | { unless (invalidi) continue; count = number of valid indexes in betteri; if (count == 0) continue; p = pi / count; foreach j (valid indexes in betteri) pj += p; } // for i Figure 4 shows the clean dictionary entry achieved after the application of the algorithm on the raw entry of Figure 3: Source local and regional authority Freqs 6 Target autorité local et régional Freqt 5 pT(t|s) 0.896423 Figure 4: The clean dictionary entry 31 3.3.2 Prefix and suffix stop-words As a consequence of the statistical model, the translation probabilities are computed for source-language units, while the estimation of local connection probabilities is done for target-language units (see Sections 2.4 and 3.1). Similar to the latter operation, the detailed alignment is also done by selecting the best source candidate for each target unit. Therefore, the dictionary used in that process should consist of targetlanguage entries instead of source-language entries. This resource is obtained by simply sorting the EM process output by target sequence. The resulting dictionary might be referred to as inverted. Yet, consider Figure 5, which presents an example of a raw entry of such an inverted dictionary. A quick look at the source entries can identify an interesting phenomenon: many suggested translations include prefix and/or suffix sub-sequences of function words or, as we have already labeled them, stop words. For example, aid for the, this aid, be grant to etc.. The occurrence of necessary prefix and suffix stop-word sequences is certainly possible. For instance, the expression dans le cadre de may be translated into in the framework of. However, when the same basic sequence is surrounded by different prefixes or suffixes, it is a good reason to believe that all of them are definitely redundant. A less but still reliable indication of redundancy is where the basic sequence occurs in the translation list of a dictionary entry only once, but is a sub-sequence of another translation. As to improve the quality of the alignment, we applied the simple method presented in Figure 6 to eliminate this kind of noise. Figure 7 presents the improved dictionary entry for the sequence aide as generated by the this algorithm. 32 Target aide Source aid for aid for the aid from aid from the aid have aid in aid in the aid measure assistance assistance for assistance in assistance to assistance to the be grant to community assistance donor for aid grant to grant to the subsidy for such aid support from this aid aid financial aid aid to financial assistance christine oddy of humanitarian the victim of pT(t | s) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.863936 0.709433 0.487991 0.425051 0.290366 0.226983 0.182403 Figure 5: A sample of a raw entry of an inverted dictionary Definition The seed of each translation is the longest sub-sequence beginning and ending in a content word. Operation 1. If two or more translations share the same seed, replace all of them by a single source entry comprised of the seed itself and the highest probability of all original sequences. 2. If a seed occurs in the translation list of an entry only once, remove the affix stop words, provided that this seed is a subsequence of another translation. Figure 6: Stop-word affix elimination algorithm 33 Target aide Source aid aid measure assistance community assistance donor grant subsidy for support from financial aid financial assistance christine oddy of humanitarian the victim of pT(t | s) 1 1 1 1 1 1 1 1 0.709433 0.425051 0.290366 0.226983 0.182403 Figure 7: The improved entry It should be noted that there are certainly many cases where the exact translation is concatenated with redundant content words. These cases, though, cannot be easily distinguished from cases where the longer translation is the correct one. Applying the noise-cleaning algorithm of Subsection 3.3.1 on the inverted dictionary is not trivial due to the lack of direct relation between the probabilities. When we applied the above method on the dictionary used for the evaluation of detailed alignment (see Subsection 4.4.1), the overall F-measure has raised by 13% (from 41% to 54%), the recall by 8% and the precision by 15%. 34 4 4.1 RESULTS AND EVALUATION Evaluation methodology The seq_align algorithm yields two types of output: (a) A bilingual sequence dictionary, and (b) a detailed sequence-level alignment. As established above, the former output is given as an input for the alignment process which generates the latter. However, both outputs might be valuable within the process of machine-aided human translation, which is the principal motivation of developing alignment algorithms (see above Section 2.1). A human translator trying to find a suitable and conventional translation for a given term would first search an entry for that term within a bilingual glossary. Then, if the term’s entry is found, the translator would scan the proposed translations to pick up a proper translation. The quality of such a glossary is measured primarily by its entry coverage—the extent to which any given term (within the glossary’s domain) is likely to be found among its entries. Assuming a reasonable entry coverage rate, the next important question is whether the correct translation appears within the suggested translation list and at which relative position (first, second or a lower-ranked option). The higher the position, the shorter the search time needed to retrieve the desired translation. When a corresponding bilingual concordance is accompanying the glossary, even partial hits are mostly beneficial, since the translator can find the complete translation by examining the corresponding alignments. The quality of the detailed alignment is obviously also substantial for the efficient work of the translator. This quality is usually measured in terms of precision and recall. Precision is the extent to which the translations suggested by the alignment‘s connections are correct (ignoring unaligned sequences). Recall is the measure telling 35 how many sequences (out of all valid sequences) in one half of the text were aligned with their correct counterparts in the other half (i.e. the alignment’s coverage rate). The way of computing the precision and recall of an alignment depends on the desired application. If only full and exact alignments are useful, then a strict definition of these measures is required, such that scores are given only if the suggested translation consists of all reference words and no redundancies. For the human translator, however, a more relaxed definition is acceptable, such that gives partial scoring to partial successes. More specifically, for a given word sequence in one text, the precision of a translation can be defined as the ratio between the number of correct words suggested by the alignment and the total number of suggested words. Similarly, the recall would be the ratio between the number of correct words suggested by the alignment and the number of correct words stated in the reference connection list. Equation (10) describes these definition in formulae. The aggregate precision and recall are therefore a simple average of these values for all reference connections. # of correct words # of all suggested words # of correct words recall # of reference words precision )11( A widely accepted measure for the overall quality of an alignment is the Fmeasure, which combines the precision and recall as follows: F 2 precision recall precision recall )11( The evaluation presented in this dissertation was performed using the data of the ARCADE project, as detailed in Section 4.2 and Subsection 4.4.1. The alignment systems which participated in this project were judged according to the above definitions of precision, recall and F. These definitions were adopted for seq_align’s evaluation in order to enable comparing the results with those achieved during the 36 ARCADE project. This comparison was done in addition to the obviously-requested assessment of seq_align versus word_align. Both evaluations are presented and discussed in Section 0. 4.2 The test corpus The test corpus selected for the evaluation of the seq_align algorithm is comprised of the English and French versions of the European Union’s JOC corpus, a text pair which had previously been used in the word track of the ARCADE evaluation project (Véronis & Langlais, 2000). The ARCADE project was the first framework in which word-level text alignment systems have been comparatively evaluated. This evaluation was intended to create a world-wide benchmark for the state of the art. Using the ARCADE data for the evaluation of seq_align enabled the comparison of the results to the global standard. The JOC corpus is a collection of written questions on various topics, directed to the Commission of the European Parliament, each of which is followed by the corresponding answer, given by the relevant official. The large variety of related topics results in an enormous quantity of specialised terms from distinct domains. The texts are supplied aligned at the paragraph level such that each pair of corresponding questions or answers in the two languages is marked by the same numeric identifier. Nevertheless, as the translation is rather precise in terms of sentence contents and order, a linear alignment within the paragraph boundaries gives a sufficiently reliable rough alignment. It should be noted that the translation of the text is indirect in the sense that at least parts of the two texts were translated from another text written in a third language. 37 The English raw text has about 1,050,000 words, whereas the respective French text consists of circa 1,162,000 words. The tokenised-lemmatised versions of these texts contain around 1,171,000 and 1,423,000 tokens, respectively.8 As stated above in Section 0, besides sequences occurring only once and those containing punctuation marks, two additional kinds of candidate sequences are filtered out before the dictionary is induced: (a) Unique-context sub-sequences, and (b) Stopword sequences. Table 2 presents the frequency distribution of the English candidate sequences of each length separately, as well as the totals for each length and for each frequency range; the aggregate total number of sequences is displayed at the bottomright corner. Table 3 gives the same kind of data as observed after the first filtering process. Table 4 reports the counts performed on the final candidate list, as attained after applying both cleaning processes. The parallel information concerning the French text is given in Table 5, Table 6 and Table 7.9 The data presented in the tables show that the contextual filtering drops the number of candidates to about one third, a quantitative effect which the elimination of stop-word sequences does not have. Nevertheless, since these sequences appear in rather high frequencies, their elimination from the candidate list avoids the creation of a large number of noisy connections during the dictionary induction process. 8 Both texts were tokenised and lemmatised using the Decision TreeTagger, kindly supplied by the IMS institute, University of Stuttgart, Germany (http://www.ims.uni-stuttgart.de/projekte/corplex/ TreeTagger/DecisionTreeTagger.html). 9 The maximal sequence length was set to the arbitrary value of 15, which seemed a rather reasonable limitation on the length of phrasal units. Empirical experience indicates that lowering this threshold does not significantly change the quality of results. 38 frq. lng. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total 2–9 10–99 100–999 8169 59409 82776 63267 41879 28086 20273 15525 12350 10127 8450 7143 6093 5200 4451 373198 3568 11455 7241 3053 1370 739 462 294 208 155 107 74 51 36 24 28837 1059 954 334 138 95 79 60 46 32 18 4 0 0 0 0 2819 1,000– 9,999 106 35 11 3 1 0 0 0 0 0 0 0 0 0 0 156 10,000– 99,999 11 2 0 0 0 0 0 0 0 0 0 0 0 0 0 13 100,000– 999,999 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Total 12913 71855 90362 66461 43345 28904 20795 15865 12590 10300 8561 7217 6144 5236 4475 405023 Table 2: Frequency distribution of the English candidate sequences before filtering. Each row details the distribution of sequences of the indicated length. frq. lng. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total 2–9 10–99 100–999 5714 39433 44764 26836 12839 5627 2801 1516 862 535 362 232 210 162 127 142020 3456 10777 6261 2316 816 315 151 75 48 30 23 12 8 8 3 24299 1050 920 279 79 24 15 1 10 12 14 4 0 0 0 0 2408 1,000– 9,999 106 34 11 2 1 0 0 0 0 0 0 0 0 0 0 154 10,000– 99,999 11 2 0 0 0 0 0 0 0 0 0 0 0 0 0 13 100,000– 999,999 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Total 10337 51166 51315 29233 13680 5957 2953 1601 922 579 389 244 218 170 130 168894 Table 3: Frequency distribution of the English candidate sequences after the elimination of unique-context subsequences. Each row details the distribution of sequences of the indicated length. 39 frq. lng. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total 2–9 10–99 100–999 5688 38255 43194 26241 12764 5621 2799 1516 862 535 362 232 210 162 127 138568 3395 10061 5844 2267 813 315 151 75 48 30 23 12 8 8 3 23053 973 740 257 79 24 15 1 10 12 14 4 0 0 0 0 2129 1,000– 9,999 69 14 11 2 1 0 0 0 0 0 0 0 0 0 0 97 10,000– 99,999 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 100,000– 999,999 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Total 10126 49071 49306 28589 13602 5951 2951 1601 922 579 389 244 218 170 130 163849 Table 4: Frequency distribution of the English candidate sequences after both unique-context sub-sequences and stop-word sequences have been eliminated. Each row details the distribution of sequences of the indicated length. frq. lng. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total 2–9 10–99 100–999 8198 50384 86274 89265 69989 50950 36737 27312 21257 17040 13971 11640 9773 8263 7040 508093 3661 11483 11380 6677 3377 1739 992 597 423 324 252 198 168 142 123 41536 1100 1323 823 308 139 93 80 61 47 33 19 5 1 0 0 4032 1,000– 9,999 106 63 23 7 3 1 0 0 0 0 0 0 0 0 0 203 10,000– 99,999 10 3 0 0 0 0 0 0 0 0 0 0 0 0 0 13 100,000– 999,999 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 Total 13077 63256 98500 96257 73508 52783 37809 27970 21727 17397 14242 11843 9942 8405 7163 553879 Table 5: Frequency distribution of the French candidate sequences before filtering. Each row details the distribution of sequences of the indicated length. 40 frq. lng. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total 2–9 10–99 100–999 5193 28362 42988 36329 22637 12625 6867 3530 1902 1146 750 542 397 284 213 163765 3509 10452 9822 5106 2235 963 438 170 86 53 41 16 21 12 16 32940 1091 1269 759 241 68 17 11 5 8 13 13 4 1 0 0 3500 1,000– 9,999 106 63 23 6 2 1 0 0 0 0 0 0 0 0 0 201 10,000– 99,999 10 3 0 0 0 0 0 0 0 0 0 0 0 0 0 13 100,000– 999,999 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 Total 9911 40149 53592 41682 24942 13606 7316 3705 1996 1212 804 562 419 296 229 200421 Table 6: Frequency distribution of the French candidate sequences after the elimination of unique-context subsequences. Each row details the distribution of sequences of the indicated length. frq. lng. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total 2–9 10–99 100–999 5183 27688 41707 35612 22405 12560 6855 3528 1901 1146 750 542 397 284 213 160771 3476 9902 9302 4965 2204 958 436 170 86 53 41 16 21 12 16 31658 1038 1087 690 234 68 17 11 5 8 13 13 4 1 0 0 3189 1,000– 9,999 78 39 21 6 2 1 0 0 0 0 0 0 0 0 0 147 10,000– 99,999 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 100,000– 999,999 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Total 9776 38717 51720 40817 24679 13536 7302 3703 1995 1212 804 562 419 296 229 195767 Table 7: Frequency distribution of the French candidate sequences after both unique-context sub-sequences and stop-word sequences have been eliminated. Each row details the distribution of sequences of the indicated length. 41 4.3 Dictionary evaluation The ideal set of parameters used in complex statistical processes, such as seq_align’s dictionary induction process, highly depends on the processed text pair and thus is almost impossible to extrapolate. Therefore, this setting is always determined based on previous empirical experience, leaving a certain “safety range” as to take into account some unprecedented or extraordinary deviations. The evaluated dictionary was produced using the following parameter setting (see above in Sections 2.4, and 3.1, where the EM process is discussed in details): Lmax (maximal sequence length) = 15 (see footnote 9 on p. 38). w (the width of the allowed window for source text candidates) = 15: Determined according to the relatively high reliability of the initial rough alignment (see above in Section 4.2). Iteration filtering threshold = 0.01 (only very unreliable connections were eliminated). Number of iterations = 10: Shown to be sufficient for the convergence of the model. Final filtering threshold = 0.1: Relatively reliable translations. Maximal frequency ratio = 100: This is a very high value. It was determined because the sample of the ARCADE project had concentrated on multi-contextual words having many different and somethimes rare translations. This parameter setting resulted in a certain level of noise. The evaluation presented below was done after the dictionary had been refined by the noise-cleaning algorithm, described in Subsection 3.3.1. As explained above (in Section 4.1), two aspects of the dictionary’s quality were evaluated: (a) Coverage 42 rate, and (b) Quality of existing entries. Both assessments were based on a random sampling of sequences. As the coverage evaluation is intended to estimate the chance to find a dictionary entry for a given term, the reference list for this task was picked of the final list of English candidates (induced from the text as described in Section 3.2). With a difference, the reference list for the qualitative entry evaluation was extracted from the list of dictionary entries. The candidate lists for both evaluations were restricted to sequences likely to be searched by a human translator. More specifically, sequences beginning or ending in stop-words or numbers were excluded because English terms are rarely of this structure. In addition, sequences occurring within the text less than 5 times were filtered out since they are less likely to be specialised terms. The random samples taken from these partial lists were finally cleaned of non-term sequences by a human judge. The coverage sample consisted of 85 terms, of which 78 (91.8%) existed as dictionary entries. Table 8 presents the entire entry’s quality sample, containing 83 English terms (selected by a human judge regardless of the first sample) along with their suggested translations. Table 9 summarises the results of the evaluation of this sample. The First, Second and Third categories refer to terms whose full and exact translations were found at the first, second and third places within the suggested translation list, respectively. The Split category relates to those terms whose translation is given completely, but is broken into more than one target sequence in the dictionary. For instance, the English term research centre should be translated into French as centre de recherche. However, the dictionary offers the separate sequences le centre and de recherche. 43 English source term joint answer to write question (149) French translations réponse commune à le question (0.8217, 145) former yugoslavia (98) le yougoslavie (1, 100) parliament 's secretariat (66) de le parlement européen (0.442128, 426), et à le secrétariat (0.143291, 87) euratom treaty (57) le traité euratom (0.766732, 57), le article (0.110678, 1433) invitation to tender (46) appel de offre (1, 92) infringement procedure (43) procédure de (0.546413, 390), infraction (0.453586, 249) eastern european country (40) le pays de (0.966365, 409) social rights (30) le droit (0.913762, 1565) prime minister (25) premier ministre (0.624228, 25), le premier (0.375772, 321) objective 1 region (24) de le objectif № 1 (1, 34) implementation of directive (23) application de le (0.797454, 603), de le directive (0.202547, 973) community support frameworks (22) cadre communautaire de appui (0.979087, 102) respect of human right (22) de le droit de (0.3009525, 557), le droit de le (0.2594965, 671), droit de le homme (0.212904, 555) research centre (21) de recherche (0.58948, 336), le centre (0.364115, 223) high education diploma (20) le diplôme de (0.548492, 38), de enseignement supérieur (0.42465, 37) infringement of community law (19) le droit communautaire (0.644084, 251), infraction (0.350831, 249) veterinary medicine (19) le médicament vétérinaire (1, 20) geneva convention (18) convention de genève (1.000001, 16) share the view (18) partager (1, 148) nuclear installation (17) le installation nucléaire (0.617818, 32), de le installation (0.32276, 96) application of directive (16) application de le directive (0.72932, 113), le application (0.27068, 508) fishery resource (16) le ressource (1, 236) community participation (15) participation de le communauté (1, 18) selection criterion (15) de sélection (0.601411, 33), critère de (0.278096, 109), le critère (0.120494, 216) flora and fauna (14) le faune (0.534696, 71), le flore (0.465304, 54) small and medium size enterprise (14) petit et moyen entreprise (0.926607, 72) commission report (13) rapport (1, 1108) northern ireland (13) irlande de le nord (1, 12) secretariat of the european parliament secrétariat général de le parlement (13) (0.5189315, 122), le secrétariat général de le (0.1905355, 139) harmful substance (12) substance (1, 259) 44 English source term cotton producer (11) financial regulation (11) regional and local body (11) toy safety (11) german law (10) legal instrument (10) technical progress (10) copyright protection (9) emilia romagna (9) forest area (9) selection board (9) air traffic (8) civil liability (8) heat treatment (8) positive action programme (8) solitary confinement (8) trade union right (8) academic staff (7) car tax (7) community import (7) non technical summary (7) radio station (7) school nurse (7) stainless steel (7) available information (6) committee of inquiry (6) French translations le producteur de coton (0.7178635, 7), de le producteur (0.1715235, 79) le règlement financier (0.86068, 11) le collectivité régional et local (0.853599, 17) sécurité de le jouet (0.815052, 22), le sécurité (0.184948, 400) de le loi (0.4649225, 118), le loi allemand (0.4608975, 9) juridique (1, 281) le progrès technique (0.913113, 10) protection de le (0.27091, 693), de le droit de (0.25928, 557), le protection (0.146291, 612), de auteur (0.118302, 48) romagne (0.80016, 9), le zone (0.19984, 472) forêt (1, 283) de le jury (0.913342, 14) trafic aérien (1.000001, 7) responsabilité civil (0.860281, 11), le responsabilité (0.139719, 158) traitement thermique (0.355835, 10), le bois (0.206769, 97), bois de (0.186439, 23), minute (0.135507, 9) en faveur de (0.3736175, 526), programme de action positif (0.213217, 6), faveur de le (0.1868085, 433) isolement (1, 15) syndical de le (0.426879, 6), droit syndical (0.345815, 10), agent de (0.18851, 28) recrutement de (0.3120455, 29), de recrutement (0.2893015, 22), enseignant (0.154137, 40) taxe automobile (0.893604, 2), local № (0.106396, 2) le importation communautaire (1, 14) résumé non technique (0.715219, 7), le article 5.2. (0.173798, 2) radio commercial (0.972542, 4) infirmier scolaire (0.598184, 6), de infirmier (0.12156, 9) en acier (0.452073, 12), tube (0.263171, 11) disponible (0.569471, 241), tronçon (0.430529, 13) commission de enquête (0.52114, 8), enquête de le (0.322032, 18), de le parlement (0.102073, 564) 45 English source term european bank (6) future cooperation (6) hygiene directive (6) insurance organization (6) legislative measure (6) medical assistance (6) national congress (6) purification system (6) social situation (6) aid budget (5) competition for recruitment (5) dental practitioner (5) ec information (5) energy issue (5) european network of high speed train (5) freedom of movement of person (5) health authority (5) illegal discharge (5) language capability (5) French translations européen pour (0.26379, 130), banque européen (0.259424, 62), le reconstruction (0.254201, 27), le développement (0.222585, 584) coopération futur (0.6936005, 4), leur futur (0.2644605, 2) sur le hygiène (0.524822, 3), de directive (0.236108, 225), proposition de (0.177081, 520) grec de (0.411557, 97), oga (0.180682, 3), organisme (0.176888, 359) le mesure législatif (0.967564, 13) assistance médical (0.7736, 7), médical en (0.201264, 5) le congrès national (1, 6) domestique de (0.414207, 6), de épuration (0.363314, 39), appareil (0.169361, 48) situation économique et (0.643715, 5), de patras (0.341051, 46) de aide (0.37485, 356), le véhicule à moteur et de (0.125203, 4), sur le véhicule à moteur (0.124907, 4), et de le taxe (0.123036, 2) condition de attribution (0.263102, 3), de monsieur virginio bettini (0.263102, 9), dernier concours (0.178227, 3), attribution et (0.131551, 4), périphérique (0.131551, 39) art dentaire (0.445513, 13), praticien (0.367018, 7), le art (0.180626, 16) juridique de (0.257913, 59), protection juridique (0.252701, 12), consommateur et (0.178553, 25), et worldcom (0.169339, 3), information sur (0.141494, 173) à le énergie (0.527547, 17), de le question (0.176263, 215) réseau européen de train à grand vitesse (0.330116, 9), publique (0.131728, 401), rendre (0.131728, 337) libre circulation de le personne (0.87184, 31) le autorité sanitaire (0.818522, 4), le santé (0.181478, 419) illicite en (0.650023, 4), rejet (0.349976, 96) le compétence linguistique (0.871604, 4) 46 English source term local tax (5) French translations taxe (0.599337, 314), frapper le produit (0.400663, 4) le csce et (0.264953, 7), humain de le (0.253833, 5), de le expert (0.143893, 44), et le rapport (0.143893, 14), le réunion (0.100291, 169) pilote (0.576723, 42), que à le moins (0.377826, 6) un particulier (1, 6) recherche dans le domaine de le (0.692034, 19) de rééducation (0.438528, 5), construction de (0.292352, 234), un centre (0.142234, 55) soutien (1, 350) le union de coopérative agricole (0.999999, 8) meeting of expert (5) pilot plant (5) private individual (5) radiation protection research action (5) rehabilitation centre (5) support activity (5) union of agricultural cooperative (5) Table 8: Sample of entry’s quality. The integers in parentheses indicate the sequences’ frequency within the text. The real numbers in the French translations column represent the probabilities of the translations as computed by the seq_align algorithm and re-estimated by the noise-cleaning algorithm Category First Second Third Split Partial Erroneous Count 42 5 1 17 17 1 % 50.6 6.0 1.2 20.5 20.5 1.2 Table 9: Entry’s quality evaluation Apparently, this phenomenon could have been perceived as a weakness of the seq_align algorithm because it does not indicate the contiguous translation. Nevertheless, looking within the text reveals that most, if not all of these target terms are actually split. For example, the term national research centre is translated into centre national de recherche, whereas community research centre yields centre communautaire de recherche. This split display of the translation does not only give the translator the knowledge that the translation might be broken, but also points to the exact location where this break normally happens. The Partial category refers to those terms which were not given the entire translation (either successive or split), but are assigned principal parts of the expected 47 target expressions. Though these entries do not supply the translator with the entire needed information, the bilingual concordance (based on the detailed alignment) can help finding the missing pieces. In fact, the first 4 categories deal with entries which give full and precise information. In other words, approximately 79% of the entries provide the translator all of the necessary knowledge. Adding the 20.5% of partial hits, which are valuable as well, it can be concluded that almost 100% of the dictionary’s entries can be useful for a human translator. 4.4 Alignment evaluation This section describes two different evaluations. The first relates to the ARCADE project full reference list, whereas the second focuses on multi-word terminology from the viewpoint of a human translator. The term-wise evaluation was done on the same dictionary used for the evaluation of Section 4.3, which had been cleaned by the noise-cleaning algorithm. The ARCADE results presented here relate to the unclean dictionary, just improved by the simple stop-word elimination algorithm (see Section 3.3). Experiments showed that though this dictionary is of a much lower quality for a human translator, a certain amount of useful information is lost during the cleaning, which decreases the grade for the ARCADE sample. As mentioned in Section 2.4, a minimal probability threshold T is pre-defined to avoid noisy connections. Experiments showed that a rather reasonable value for this parameter is 0.005, though slight changes do not make a large difference. 4.4.1 The ARCADE evaluation The ARCADE project (Véronis & Langlais, 2000) was intended to organise a comparative evaluation between different systems for text alignment. One of the project’s tracks was dedicated for word and expression alignment. 48 The reference list for the word track was prepared as follows: 60 French words were chosen—20 adjectives, 20 nouns and 20 verbs. Each of these words appears within many different contexts, including multi-word idiomatic expressions. About 60 occurrences of each distinct word across the JOC corpus were marked up. Then, two human annotators were asked to mark the entire French expression within which each of the words appeared and then the English counterpart of that expression. That way, a set of 3723 French/English word/expression pairs was created. The annotation of each human judge was preserved even when there was a disagreement on either the French or English unit. In such cases, the evaluation procedure was instructed to take the better grade of two. It should be noted that some of the marked expressions were split (non-contiguous). Naturally, in some cases no English correspondence existed, thus the annotators had to leave the English column blank. The precision and recall for that event were defined as 1 for a blank answer and 0 otherwise. The task set for the systems participating in the ARCADE competition was equivalent to that given to the human annotators, i.e. to (automatically) identify the French expression which possibly encapsulates the reference word and to indicate its translation within the English text. The publicised results, however, relate only to the latter part, that of finding the correct translations. Five research groups responded to the ARCADE challenge and sent their system results for the above-mentioned reference list. The best results were achieved by the system of the Xerox Research Centre Europe (XRCE), Grenoble, France, developed by Éric Gaussier and David Hull (as for their method, see above in Sections 2.2 and 2.3). The other participants were the Linköping Word Aligner (Ahrenberg, Andersson & Merkel, 2000), and the systems developed by the RALI group (University 49 of Montreal, Canada), the CEA group (Gif-sur-Yvette, France) and the LILLA group (Nice, France). The alignment methods applied by the latter three systems have never been published. Table 10 details the achievements of each system in each of the three grammatical categories—adjectives (A), nouns (N) and verbs (V)—as well as the overall averages. The table also gives the same kind of data regarding the original word_align algorithm and its seq_align extension. The four systems who have not won the competition are labelled S1…S4, as done in (Véronis & Langlais, 2000), in order not to embarrass any participant.10 As indicated by the table, both word_align and seq_align have achieved the same overall F percentage. Nonetheless, there are some differences between these two algorithms in terms of precision and recall. It can be said that word_align is more precise, while seq_align has a better coverage rate. In comparison with the other participants, both algorithms take the fourth place in the list (between S2 and S3). The detailed results indicate that both word_align and seq_align have difficulties in aligning cases of inconsistent translation, which are not infrequent in the JOC corpus. As mentioned above (in Section 2.2), the winning XEROX system uses Hiemstra’s bi-directional model. Some authors had previously argued that such a model is more robust, but their claim was not proven quantitatively. As most reference connections were of the one-to-one type, the large difference between the results of our algorithms and those of the XEROX system must be a consequence of the underlying mathematical models. This leads to a clear conclusion that Hiemstra’s model is better than IBM’s Model 2. Cat Entries System Prec Rec 10 F The full data have been kindly provided to us by Mr. Véronis. The accompanying analysis is based on these data. 50 Cat Entries System A 1167 S1 S2 S3 S4 XEROX WA SA N 1055 S1 S2 S3 S4 XEROX WA SA V 1501 S1 S2 S3 S4 XEROX WA SA All 3723 S1 S2 S3 S4 XEROX WA SA Prec 0.43 0.31 0.63 0.63 0.84 0.61 0.58 0.31 0.22 0.70 0.61 0.78 0.62 0.53 0.21 0.08 0.47 0.58 0.72 0.46 0.46 0.31 0.19 0.58 0.60 0.77 0.55 0.52 Rec 0.42 0.31 0.63 0.77 0.84 0.59 0.63 0.30 0.21 0.68 0.76 0.76 0.59 0.57 0.20 0.08 0.42 0.67 0.62 0.43 0.48 0.30 0.19 0.56 0.73 0.73 0.53 0.55 F 0.43 0.31 0.63 0.66 0.84 0.60 0.61 0.30 0.21 0.68 0.65 0.76 0.60 0.55 0.21 0.08 0.44 0.58 0.65 0.45 0.47 0.30 0.19 0.57 0.63 0.74 0.54 0.54 Table 10: The ARCADE evaluation data together with those of word_align (WA) and seq_align (SA) As already mentioned above (in Section 2.3), the Xerox method uses language-specific syntactic knowledge to identify MWU candidates (as done by most alignment methods). However, it seems that the principles of the extension of word_align to seq_align, i.e. estimating the model’s parameters for all valid sequences while considering length relations, are applicable to Hiemstra’s model as well. Extending Hiemstra’s algorithm that way can yield a new robust parsing-independent algorithm for single- and multi-word alignment. 51 4.4.2 Evaluation of word_align vs. seq_align As the main goal of this research was to extend the word_align algorithm so that it could handle multi-word units without using any syntactic parsing, a specific comparison between the two algorithms in this aspect is obviously required. As established above, the interest in MWUs is primarily in the alignment of specialised terminology. Due to time and manpower constraints, it had not been feasible to generate a representative sample of connections, such that would test the performance of the two algorithms on terms of various frequencies. Therefore, we had to suffice with a less ideal sample, which we derived from the ARCADE sample by manually selecting connections where the French part was a multi-word term. Most of the terms in the resulting sample are of very low frequencies (below 5 occurrences), a property which is not characteristic for principal terms in a domainspecific corpus. In addition, some of the terms were translated inconsistently (as seen in the table below), which is also atypical for terminology. Nevertheless, the results for this sample somehow reflect the relative levels of the judged methods. The Termight method (Dagan & Church, 1994, 1997; see above in Section 2.3) is a simple way to extend single-word to multi-word alignment, without parsing both texts. However, monolingual parsing is applied in order to identify the boundaries of the terms in one of the languages, then being able to build a glossary and a corresponding concordance. The situation of machine-aided human translation (as described above in Section 4.1) is such that the translator usually seeks for the translation of a specific term in the source language. Thus, it is unnecessary that the system fully automatically identifies the boundaries of that term. Nonetheless, if parsing is not available, it is impossible to prepare a glossary of a reasonable size in advance. 52 In order to fairly compare word_align with seq_align, we decided to try them in the Termight context, where word_align is expected to yield its best results. Hence, the English counterparts were defined as the concatenation of all words residing between the leftmost and rightmost alignments of the given French words. As mentioned above (in Subsection 4.4.1), the original ARCADE reference list indicated only one of the expression’s words, expecting the annotator/system to identify the rest. However, since we have also had the human annotators’ results, we could replace the single words with the entire expressions as determined by the judges. For simplicity, we took only connections where there was a full agreement between the two annotators on both French and English parts. The sample and the corresponding results are presented in Table 11. The average precision, recall and F rates are shown in Table 12. The results resemble the ARCADE results in the numeric aspect as well as the trend they indicate (i.e. word_align‘s advantage in precision and seq_align’s advantage in recall). A closer look at the detailed results reveals that a significant part of seq_align’s precision problem is related with its tendency to accompany the translation with some surrounding words. In most cases, these additional words are those words which were typically found around the translating unit. It is certainly expectable that if an expression appears in many different contexts and is translated consistently, the level of noise produced by the algorithm will decrease (as happens for the more frequent terms such as compagnie aérienne, économie d’énergie etc.).. It should be noted that the Termight method cannot be fully successful in handling cases of one-to-many alignments, because the word_align’s output suggests only a single source word for each target word. Nonetheless, many-to-one and many-to- 53 many alignments are workable for both algorithms, though seq_align is expected to do better in cases of non-literal translations. In light of the similar quality of the alignments generated by the two algorithms, it is important to stress that the main advantage of seq_align over word_align is related to the dictionary. While word_align’s dictionary gives the translations of single words only, seq_align’s dictionary is a single- and multi-word dictionary. Word_align can provide a multi-word glossary only if syntactic parsing is applied to the single-word alignment (as done by Termight (Dagan & Church, 1994, 1997)). With a difference, seq_align manages to induce a high-quality bilingual glossary directly from the text, without using any syntactic knowledge. Recall that a multi-word glossary, which supplies term translations to its user, is also necessary for efficient retrieval of alignment examples. Hence, parsingindependent compilation of glossaries enables providing useful translation aids for many pairs of languages even where reliable parsing is not available. French réserves biologiques lignes à haute tension haute priorité haute performance haute performance haute performance word_align English Prec 0.00 English (Reference) habitat Rec 0.00 English seq_align Prec 0.00 Rec 0.00 power lines lines 1.00 0.50 lines 1.00 0.50 high priority high performance high performance high performance high priority high performance high performance high performance 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.67 1.00 haute performance haute performance haute technicité haute mer high performance high performance high technology sea going high mance high mance perfor- 1.00 1.00 1.00 1.00 perfor- 1.00 1.00 1.00 1.00 0.00 0.18 0.00 1.00 0.00 1.00 0.00 0.50 haute technologie ligne à haute tension haute tension haute technologie high technology sea going vessel in the first half of 1993 to patrol high technology high priority high performance high performance high performance computing high performance high performance to operate with sea 1.00 1.00 high technology 1.00 1.00 power line line 1.00 0.50 line 1.00 0.50 high voltage high technology high voltage high technology 1.00 1.00 1.00 1.00 high high technology 1.00 1.00 0.50 1.00 54 French haute pression haute tension haute technologie télévision à haute définition télévision à haute définition Télévision à haute définition centre historique centre historique English (Reference) high pressure high voltage word_align English Prec high pressure 1.00 voltage 1.00 Rec 1.00 0.50 high technology high technology 1.00 1.00 seq_align English Prec high pressure 1.00 high voltage 0.22 supply market is created between the entities high technology 1.00 0.00 0.00 HDTV 1.00 1.00 HDTV Rec 1.00 1.00 1.00 HDTV HDTV 1.00 1.00 HDTV in 0.50 1.00 HDTV HDTV 1.00 1.00 HDTV in 0.50 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.23 1.00 full time full time full time full time time and ( b ) full full time Werke full time 1.00 1.00 1.00 1.00 0.33 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.29 0.50 1.00 1.00 1.00 1.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 1.00 1.00 1.00 old town quarters old town quarters Community funding for studies connected with restoration of Palermo 's old town quarters time full time full time full time time and ( b ) full time full time plein temps temps plein temps plein temps plein temps plein full time full time full time full time full time temps plein plein gré temps plein plein fouet full time willingly full time hard hit temps plein enseignement secondaire enseignement secondaire enseignement secondaire enseignement secondaire enseignement secondaire effets secondaires effets secondaires full time secondary school secondary school secondary school secondary school secondary school side effects full time secondary 1.00 1.00 1.00 0.50 full time unfair competition from third countries full time secondary school secondary 1.00 0.50 secondary school 1.00 1.00 secondary 1.00 0.50 secondary school 1.00 1.00 secondary 1.00 0.50 secondary school 1.00 1.00 secondary 1.00 0.50 secondary school 1.00 1.00 effects 1.00 0.50 effects 1.00 0.50 side effects 0.17 1.00 effects 1.00 0.50 programmes secondaires programmes secondaires sensibles du point de vue des nitrates à coup sûr charge utile chefs d'entreprise sub schemes necrosis factor to be used , while avoiding the serious side effects sub 1.00 0.50 0.00 0.00 sub schemes sub 1.00 0.50 0.00 0.00 nitrate sensitive encouragement in nitrate sensitive 0.50 1.00 nitrate sensitive 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 100 tonnes managers and 0.00 0.00 0.50 0.00 0.00 1.00 clear grounds payload managers 55 word_align English Prec entrepreneurs 1.00 Rec 1.00 English d'entre- English (Reference) entrepreneurs d'entre- entrepreneurs entrepreneurs 1.00 1.00 d'entre- entrepreneurs entrepreneurs 1.00 1.00 d'entre- Employers Employers 1.00 1.00 female entrepreneurs female entrepreneurs The Confederation of Galician Employers chefs d'entreprises chefs d'accusation chefs d'inculpation chefs d'entreprise chefs d'entreprise chefs d'entreprise chefs d'entreprise chefs d'inculpation compagnie aérienne compagnie aérienne compagnie aérienne compagnie aérienne compagnie de transports aériens compagnie de transports aériens compagnie aérienne animaux de compagnie employers 0.00 charges charges compagnie aérienne compagnie aérienne constitution constitution constitution constitution constitution French chefs prise chefs prises chefs prises chefs prises Détention ventive Détention ventive détention ventive détention ventive seq_align Prec 0.00 Rec 0.00 0.50 1.00 0.50 1.00 0.20 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 1.00 businessmen businessmen 1.00 1.00 property in Istanbul of businessmen businessmen businessmen 1.00 1.00 of businessmen 0.50 1.00 businessmen businessmen 1.00 1.00 of businessmen 0.50 1.00 entrepreneurs entrepreneurs 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 charges airline airline 1.00 1.00 infringement of the rights to airline airline airline 1.00 1.00 airline 1.00 1.00 airline airline 1.00 1.00 airline 1.00 1.00 airline airline 1.00 1.00 0.50 1.00 airline Turkish national airline 0.33 1.00 airline authorities Turkish national airline 0.33 1.00 airline Turkish national airline 0.33 1.00 Turkish national airline 0.33 1.00 airline airline 1.00 1.00 airline 1.00 1.00 pets 0.17 1.00 1.00 1.00 1.00 pets under CITES , these animals airline 0.17 airline pets under CITES , these animals airline 1.00 1.00 airline airline 1.00 1.00 airline 1.00 1.00 setting up setting up building up setting up setting up 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 pré- held on remand 0.00 0.00 0.00 0.00 pré- held on remand 0.00 0.00 Editor of 0.00 0.00 pré- held on remand 0.00 0.00 held on remand 1.00 1.00 pré- held on remand 0.00 0.00 trade unionists 0.00 0.00 Voutsas a computerized undertakings and groups of undertakings 56 French économies d'énergie économies d'énergie économies d'énergie économies d'énergie économies d'énergie formations formations formations phase de lancement lancement lancement English (Reference) energy saving word_align English Prec saving 1.00 Rec 0.50 seq_align English Prec energy saving 1.00 Rec 1.00 energy saving saving 1.00 0.50 energy saving 1.00 1.00 energy saving saving 1.00 0.50 0.67 1.00 energy saving saving 1.00 0.50 0.67 1.00 energy saving saving 1.00 0.50 energy saving projects energy saving projects energy saving 1.00 1.00 training courses training courses training courses start up phase training training phase 0.00 1.00 1.00 1.00 0.00 0.50 0.50 0.33 training courses training start up phase 0.00 1.00 1.00 1.00 0.00 1.00 0.50 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.30 1.00 0.00 0.00 0.00 0.00 0.50 1.00 start up starting up technology and know how to start up phase phase de lancement phase de lancement start up phase phase 1.00 0.33 start up phase phase ( 1992 96 0.33 0.33 lancement organes de presse organes de presse organes de presse organes institutionnels organes génitaux organes de presse passage en revue passage frontière marche à pied cultures sur pied starting up newspapers 0.00 0.00 0.00 0.00 publications 0.00 0.00 titles 0.00 0.00 0.00 0.00 institutions 0.00 0.00 0.00 0.00 march crops marche à pied marche à pied marche à pied station balnéaire station éolienne suspension taxes tarifs vols start up phase and in anticipation of a main phase new publications genitals limbs 0.00 0.00 limbs 0.00 0.00 press press 1.00 1.00 press 1.00 1.00 review 0.00 0.00 0.00 0.00 border crossing 0.00 0.00 0.00 0.00 Blast crops 0.00 1.00 0.00 1.00 0.00 0.17 0.00 1.00 walking walking walking tourist beach wind farm walking walking 1.00 1.00 0.00 0.00 0.50 1.00 1.00 0.00 0.00 0.50 0.00 0.00 0.00 0.00 0.50 0.00 0.00 0.00 0.00 1.00 tax suspension air fare tax suspension fares have suddenly 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 Jandía wind Blast and compensate the farmers whose crops cycling and be played by cycling and the Jandía wind farm tax suspension fares have suddenly Table 11: Evaluation sample of term alignment by word_align and seq_align Algorithm word_align seq_align Prec 0.59 0.51 Rec 0.54 0.60 F 0.56 0.55 Table 12: Average quality of term alignment by word_align and seq_align 57 5 CONCLUSIONS AND FUTURE WORK One of the problematic issues in bilingual terminology extraction and detailed alignment has been the identification of term boundaries, which is a pre-condition for the compilation of a term-level bilingual glossary and a corresponding concordance. Most authors have solved the problem by applying language-specific monolingual syntactic analysis to at least one of the text halves. This approach has yielded very impressive results, but could not provide a generic, language-independent solution for the problem. The seq_align algorithm, presented in the current dissertation, had been initiated in order to supply such a general solution, especially for cases where a highquality parsing is not available. For this purpose, we had taken the well-known word_align algorithm (Dagan, Church & Gale, 1993) as a basic model which would be developed towards a syntax-independent algorithm for the treatment of multi-word units (MWUs). Unlike word_align, which requires monolingual syntactic analysis in order to compile a bilingual multi-word glossary (as done by Termight (Dagan & Church, 1994, 1997)), the seq_align method does not make any presumptions on the availability of such knowledge. As the evaluation of the bilingual dictionary shows, a useful glossary can be induced regardless of syntactic considerations. When applied on the EM process‘ output, the noise-cleaning algorithm yields a comprehensive and precise glossary of the principal terms which appear in the given text using statistical data only. This glossary also indicates the exact location of potential breaks in the target compounds, which is a very helpful information for a translator. Both word_align and seq_align have achieved reasonable qualities of detailed alignments. Though being assigned approximately the same average grade, it has 58 come out that word_align is slightly advantageous in terms of precision, whereas seq_align is favourable in terms of recall. It should be noted that the aggregate equivalence of the two algorithms has been reached despite of the much greater number of candidates considered by seq_align, which could have had a serious effect on the quality of its output. Recall, however, that this achievement is strongly related to the removal of affix stop words from the source entries of the inverted dictionary (see Subsection 3.3.2). The results of the comparison of word_align and seq_align with other alignment systems on the basis of the ARCADE project data suggest the superiority of Hiemstra’s model (Hiemstra, 1996) used by the XEROX system, over IBM’s Model 2, used by our algorithms. Seemingly, this supports the claim raised by several authors that a non-directional model represents the relations between parallel texts better than a directional one. In any case, the basic idea of seq_align, i.e. estimating the model’s parameters for all valid sequences while considering length relations, is applicable to Hiemstra’s model too. Such an extension of Hiemstra’s method is likely to provide reliable MWU alignment for many language pairs without using any syntactic knowledge. 59 REFERENCES Ahrenberg, L., Andersson, M. & Merkel, M. (1998). A Simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts. Proceedings of 36th Annual Meeting of the Asso-ciation for Computational Linguistics and 17th International Conference on Computational Linguistics, Montréal, Canada, 10– 14 August 1998, 9-35. Ahrenberg, L., Andersson, M. & Merkel, M. (2000). A knowledge-lite approach to word alignment. Parallel Text Processing (Véronis, J., Ed.), 97–116. Dordrecht, Kluwer Academic Publishers. Baum, L. E. (1972). An inequality and an associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3, 1-8. Brown, P. F., Della Pietra, S., Della Pietra, V. J. & Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19 (2), 263–311. Brown, P. F., Lai, J. C. & Mercer, R. L. (1991). Aligning Sentences in Parallel Corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, 169–176. Bourigault, D. (1992). Surface grammatical analysis for the extraction of terminological noun phrases. Proceedings of the 14th International Conference on Computational Linguistics (COLING’92), Nantes, France, 977–981. Choueka, Y., Conley, E. S. & Dagan, I. (2000). A comprehensive bilingual word alignment system : application to disparate languages: Hebrew and English. Parallel Text Processing (Véronis, J., Ed.), 69–96. Dordrecht, Kluwer Academic Publishers. Church, K. W. & Gale, W. A. (1991). Concordances for Parallel Text. In Using Corpora: Proceedings of the Eighth Annual Conference of the UW Centre for the New OED and Text Research (Oxford, September 29 – October 1, 1991), 40– 62. Cover, T. M. & Thomas, J. A. (1991). Elements of Information Theory. New York: John Wiley & Sons, Inc.. Dagan, I. & Church, K. W. (1994). Termight: Identifying and translating technical terminology. Proceedings of the 4th Conference on Applied Natural Language Processing, 34–40. Dagan, I. & Church, K. W. (1997). Termight: Coordinating humans and machines in bilingual terminology acquisition. Machine Translation, 12 (1–2), 89–107. Dagan, I., Church, K. W. & Gale, W. A. (1993). Robust bilingual word alignment for machine aided translation. Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1–8. Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie : statistiques lexicales et fitres linguistiques. Unpublished doctoral dissertation, Université de Paris VII. Daille, B., Gaussier, E. & Langé, J.-M. (1994). Towards automatic extraction of monolingual and bilingual terminology. Proceedings of the 15th International Conference on Computational Linguistics (COLING’94), Kyoto, Japan, 712– 716. 60 Debili, F. & Sammouda, E. (1992). Appariement des Phrases de Textes Bilingues. Proceedings of the 14th International Conference on Computational Linguistics (COLING’92), Nantes, France, 517–538. Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39 (B), 1–38. Fung, P. & Church, K. W. (1994). K-vec: A New Approach for Aligning Parallel Texts. Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, 1096–1102. Gale, W. A. & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19 (3), 75–102. Gaussier, É., Hull, D. & Aït-Mokhtar, S. (2000). Term alignment in use : Machineaided human translation. Parallel Text Processing (Véronis, J., Ed.), 253–274. Dordrecht, Kluwer Academic Publishers. Haruno, M. & Yamazaki, T. (1996). High-performance bilingual text alignment using statistical and dictionary information. Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL’96), Santa Cruz, California, 131–138. Haruno, M. & Yamazaki, T. (1997). High-performance bilingual text alignment using statistical and dictionary information. Journal of Natural Language Engineering, 3 (1), 1–14. Hiemstra, D. (1996). Using Statistical Methods to Create a Bilingual Dictionary, Unpublished Master's thesis, Universiteit Twente. Jacquemin, C. (1991). Transformation des noms composés. Unpublished doctoral dissertation, Université de Paris VII. Johansson, S., Ebeling, J. & Hofland, K. (1996). Coding and aligning the EnglishNorwegian parallel corpus. In Aijmer, K., Altenberg, B., Johansson, M. (Eds), Languages in Contrast. (Papers from a Symposium on Text-based Crosslinguistic Studies, 4–5 March 1994, pp. 85–112). Lund : Lund University Press. Kay, M. & Röscheisen, M. (1988). Text-translation alignment. Technical Report. Xerox Palo Alto Research Center. Kay, M. & Röscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19 (1), 121-142. McEnery, A. M., Langé, J.-M., Oakes, M. P. & Véronis, J. (1997). The exploitation of multilingual annotated corpora for term extraction. Corpus Annotation: Linguistic Information from Computer Text Corpora (Garside, R., Leech, G. & McEnery, A. M., Eds.), 220–230. London, Addison Wesley Longman. Melamed, I. D. (1996). Automatic detection of omissions in translations. Proceedings of the 16th International Conference on Computational Linguistics (COLING’96), Copenhagen, 764–769. Melamed, I. D. (1997a). Automatic discovery of non-compositional compounds in parallel data. Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP'97), Providence, RI, 7-108. Melamed, I. D. (1997b). A word-to-word model of translational equivalence. Proceedings of the 35th Conference of the Association for Computational Linguistics (ACL'97), Madrid, 490-497. Simard, M., Foster, G. F. & Isabelle, P. (1992). Using cognates to align sentences in bilingual corpora. Proceedings of the Fourth International Conference on 61 Theoretical and Methodological Issues in Machine Translation (TMI), Montréal, Canada, 25–27 June 1992, 67–81. Smadja, F. A. (1992). How to Compile a Bilingual Collocational Lexicon Automatically. Proceedings of the AAAI Workshop on Statistically-Based NLP Techniques. Smadja, F. A. (1993). Retrieving collocations from text : Xtract. Computational Linguistics, 19 (1), 143–177. Smadja, F. A., McKeown, K. R. & Hatzivassiloglou, V. (1996). Translation collocations for bilingual lexicons: a statistical approach. Computational Linguistics, 22 (1), 1–38. Véronis, J. & Langlais, P. (2000). Evaluation of parallel text alignment systems : The ARCADE project. Parallel Text Processing (Véronis, J., Ed.), 369–388. Dordrecht, Kluwer Academic Publishers. 62