Journal of Chinese Language and Computing 16 (4): 225-237 225 A New Dictionary-based Word Alignment Algorithm Zhiming Xu*† , Jonathan J. Webster* and Chunyu Kit* * Department of Chinese, Translation and Linguistics, City university of Hong Kong, Tat Chee Ave., Kowloon, Hong Kong † School of Computer Science and Technology, Harbin Institute of Technology, P. R.C. China Box 321, Harbin Institute of Technology, Harbin, China 150001 xuzm@hit.edu.cn ctjjw@cityu.edu.hk ctckit@cityu.edu.hk Abstract This paper presents our recent work on dictionary-based word alignment for the English-Chinese bilingual corpus of Hong Kong legal texts. It is intended to address a number of critical issues involved in word alignment, including monolingual pre-processing for identifying text items for alignment, proper formulation of similarity measure for bilingual word pairs, and the novel treatment of many-to-many word alignment. A dictionary-based word alignment algorithm (DictAlign) is proposed and tested with the bilingual texts of Hong Kong laws. Experimental results show that this approach achieves an impressive performance: 95.70% precision and 68.39% coverage rate for English, and 95.73% precision and 64.87% coverage rate for Chinese. Keywords: word alignment, word similarity 1. Introduction The main task of word alignment is to identify translation equivalents between lexical items in bilingual parallel texts (or bitexts). In addition to individual words, i.e., single-word units, the lexical items to be aligned also include multi-word units, such as 226 Zhiming Xu, Jonathan J. Webster and Chunyu Kit compound words, terms, idiomatic expressions and some text chunks of other types. The lexical items for alignment are referred to as link units (Ahrenberg, Magnus, et al, 2000; Tiedemann, 1999). Bilingual word alignment usually assumes a segment-aligned bitext as input, where bitext segments may be mutual translations of clauses, sentences, and sentence sequences (Tiedemann, 1999). In general word alignment work involves the following typical tasks: (1) acquisition of necessary resources, including bilingual dictionaries, bilingual terminology and monolingual stop (or functional) word lists; (2) segmentation of monolingual sentences into link units; and (3) alignment of bilingual link units in a given bitext. Word alignment systems may aim at different purposes, e.g., full-text word alignment vs. bilingual lexicon extraction (Ahrenberg, Magnus, et al, 2000). Our recent work, as part of an ongoing research project on EBMT for Hong Kong legal texts, is aimed at full-text word alignment. The outcomes of this work are expected to facilitate text alignment at higher linguistic levels, such as phrase/chunk alignment and sentence alignment, and translation selection (Och and Ney, 2003; Knight, 1999; Wu, 1996). Our word alignment work addresses both single-word and multi-word link units. The multi-word link units sometimes appear in corpus in discontinuous form (parts of them are separated by some determinative, adjectives and so on). It is not surprising that they are segmented into fragments. Also, because of the imperfect segmentation of link units, mostly due to segmentation ambiguities, continuous multi-word units may be segmented into fragments. Problems of this kind make the alignment of these link units more complicated than the one-to-one word alignment. Once a pair of multi-word link units are both segmented into several fragments, the many-to-many alignment problem arises, although the alignment of other types, including one-to-one, one-to-many and many-to-one alignment, are more popular in practice. In this paper we will present our recent work on (1) the segmentation of English and Chinese monolingual Hong Kong legal texts into link units and (2) the word alignment on these bilingual link units. The remaining sections of the paper are organized as follows. Section 2 introduces our probabilistic approach to link unit segmentation. Section 3 formulates an empirical similarity measure for identifying word pairs of mutual translation with the aid of an available bilingual dictionary. Section 4 proposes a dictionary-based word alignment algorithm, called DictAlign. It copes with all types of alignment of discontinuous link units. Section 5 presents our experimental results on the BLIS corpus, a complete collection of English-Chinese bilingual texts of Hong Kong laws, using this algorithm. It achieves an alignment precision of 97.7%, with a coverage rate of 64.6% and A New Dictionary-based Word Alignment Algorithm 227 67.0%, respectively, for the Chinese and English texts in the input bilingual corpus. Section 6 concludes the paper and discusses possible directions for future work. 2. Identification of monolingual ling units 2.1 Pre-processing The proper segmentation of monolingual texts into link units is an essential task to word alignment if we intend to extend word alignment to handle multi-word link units in addition to single-word link units. Two typical preprocessing tasks are involved in this stage of work for the identification of such monolingual link units, namely, lemmatization and tokenization. In our work, the process of lemmatization takes place with the preprocessing of English texts, in order to convert variant word forms back to their lemmas, or base forms. Our lemmatization module exploits a number of morphological rules to carry out the conversion for regular nouns and verbs. Irregular nouns and verbs are handled individually with the aid of a mapping table of irregular word forms and their lemmas. Tokenization is a process to recognize text tokens for further processing. Here tokens can be viewed as atomic text units that form the larger text units such as words for later processing. They are not to be further decomposed into smaller units. In our work, a token in Chinese is a Chinese character of fixed length (2 bytes) – it is worth noting that the tokenization here for Chinese does not involve word segmentation; a token in English is a single word, which is generally delimited by spaces and punctuations. A link unit in both languages is a sequence of one or more than one token. For example, in the pair of link units “香港”/ “Hong Kong”, “香港” consists of two tokens “香” and “港”, and “Hong Kong ” of two tokens “Hong” and “Kong”. 2.2 Statistical approach to link unit identification After the pre-processing phase, each sentence S in an input text T is converted into a token sequence S t1t2 tn t1n . The task of statistical link unit segmentation is to find a link unit sequence over this token sequence S u1u2 um u1m that maximizes the Zhiming Xu, Jonathan J. Webster and Chunyu Kit 228 probability p(u1m ) . We use a link unit-based unigram as language model and m define p(u1m ) p(ui ) , where p(ui ) n(ui ) / n(u) , and n(ui ) is the frequency of the i 1 uT link unit ui in T . The language model is trained on the unsegmented corpus T using EM algorithm (Dempster, 1977). The training process is similar to the EM-training presented in Kit et al. (2003) for probabilistic Chinese word segmentation. We train unigram models for English and Chinese using the BLIS corpus, and utilize them respectively to identity link units in each language. 3. Bilingual word similarity A machine-readable bilingual dictionary is the most critical lexical resource in our dictionary-based approach to word alignment. In general, word alignment approaches can be categorized into two types: statistical-based, as illustrated, for example, in Brown et al. (1993) and Melamed (2000), and dictionary-based, as we practice here. We expect that a dictionary-based word alignment approach can achieve a better performance than statistical word alignment methods, because the lexical resources that it makes use of is more reliable than pure statistical information. The lexical resources that DictAlign uses include a Chinese-English dictionary of frequent words and a bilingual term list. Given a bilingual sentence pair ( S , T ) , each as a sequence of monolingual link units, the task of word alignment is to identify as many correspondent pairs of link units as possible in the sentence pair. For each link unit s in source-language sentence S , we want to find its counterpart in the target sentence T . This counterpart of s is called its in-context translation (ICT), denoted as ict ( s ) . On the other hand, its dictionary translation as extracted from an available bilingual dictionary is called its dictionary translation (DT), denoted as dt ( s) (Ker and Chang, 1997). For any bilingual link unit pair (s, t ) in ( S , T ) , we need to estimate the likelihood that (s, t ) is mutual translation. We can estimate such likelihood in terms of their bilingual word similarity, denoted as sim( s, t ) . Statistical measures, such as mutual information, dice coefficients and other statistical metrics, are commonly used to estimate the statistical association strength or the distribution similarity of a word pair. Here we take a A New Dictionary-based Word Alignment Algorithm 229 dictionary-based approach to estimate the string similarity between ict ( s ) and t for the purpose of determining whether (s, t ) are mutual translation. String similarity can be computed in terms of the longest common subsequence ratio (LCSR), as illustrated in Tiedemann (1999) for the purpose of determining whether a word pair is a cognate pair. Here, we also resort to LCSR for estimating the string similarity between ict ( s ) and t . With regard to the symmetry in word alignment, we define the bilingual word similarity of a link unit pair (s, t ) as follows: sim( s, t ) where LCSR( x, y) LCSR( s, dt (t )) LCSR(dt (s), t ) 2 (1) length( Longest Common Subsequence( x, y )) . sim( s, t ) gives a score max(length( x), length( y)) to indicate likelihood that (s, t ) represents a mutual translations. sim( s, t ) is symmetrical, i.e., sim(s, t ) sim(t , s) . To avoid unexpected noises for the word alignment, if the text strings in the pair (s, dt (t )) or (dt (s), t ) do not include one another, then we set sim(s, t ) 0 . 4. The DictAlign algorithm Source s1 s2 s3 s4 s5 s6 Target t1 t2 t3 t4 t5 t6 Figure 1. Direct link digraph of word alignment Figure 1 shows an bilingual sentence pair ( S , T ) , where S {s1 , s2 , T {t1 , t2 , , sm } and , tn } , each being a set of content-word link units after the monolingual link unit segmentation phase and filtering out functional words. DictAlign, our dictionary-based word alignment algorithm, is designed to find as many proper pairs of link units as possible in the product set S T . Zhiming Xu, Jonathan J. Webster and Chunyu Kit 230 4.1 Baseline algorithm of word alignment In order to evaluate the performance of the DictAlign algorithm by comparison, we firstly implement a one-to-one word alignment algorithm as a baseline, using dice coefficients as similarity measure. It is given as follows. Given a pair of sentences ( S , T ) , each as a sequence of content-word link units, S {s1 , s2 ,, sm } , and T {t1 , t2 ,, tn } , the similarity sim ( s, t ) for each link unit pair ( s, t ) in S T is first computed. (1) Select the link unit pair (si , t j ) arg max sim (s, t ) , if sim ( si , t j ) 0 . If ( s , t )S T not available, then exit. (2) Align ( si , t j ) , and let S S {si } , T T {t j } . (3) Go to (1). 4.2 DictAlign word alignment algorithm The bilingual dictionary used in our research subsumes a bilingual glossary of Hong Kong laws, with many multi-word terms that sometimes appear in a discontinuous form in the corpus. It is also quite common that even the continuous ones are segmented into several fragments by the process of link unit segmentation, mostly due to segmentation ambiguities. If a multi-word unit and its ICT are both in fragments, we have a many-to-many mapping problem in our alignment work. If only one of the two counterparts is in fragments, we have a one-to-many (or many-to-one) mapping problem, a special case of many-to-many mapping. According to our observation, such special cases take place more often than the many-to-many alignment, which takes place occasionally. In order to decide whether two given sets ( set1 , set2 ) of such fragments should be aligned with each other we need and set2 {tn1 , tn2 to calculate their similarity. Assuming set1 {sm1 , sm2 smk } tnl } , each being a sequence of segmented fragments, we define their similarity as follows: sim(set1 , set2 ) sim(sm1 sm2 smk , tn1 tn2 tnl ) (2) A New Dictionary-based Word Alignment Algorithm 231 where o denotes string concatenation. The computation involved in (2) is identical to that for (1), except the concatenation before the computation. It is worth noting, however, that this formula is effective in handling fragmental in-vocabulary (IV) link units, not any out-of-vocabulary (OOV) ones. Given a candidate pair of fragmental link units, if one is IV in this language and the other OOV in another language, they may only be partially aligned using (2). Firstly, to align fragmental link units, we need to construct the map set map(u) for each link unit in S or . T map(si ) {t T | s S , sim (si ,t ) sim (s ,t )} ; For For each each si S , tj T , map(t j ) {s S | t T , sim(s, t j ) sim(s , t )} . That is, a link unit t in map( s) takes s as the most possible ict (t ) . Their relation is denoted by a direct link: t s , as in Figure 1. The length of t s is defined as sim(t , s) . Once the map set of each link unit in S or T is constructed, we select the longest direct link si t j from all the direct links in S T , with sim( si , t j ) 0 . If no direct link available, we are done; otherwise, for a selected direct elements (set1 , set2 ) link arg max ( x , y )2 map ( si ) si t j , we seek the pair of sim( x, y) , where 2map ( u ) denotes the power set of map ( t j ) 2 map(u) . Then we align ( set1 , set2 ) , and let S S set1 , T T set2 . Next we re-construct the map set of each unaligned ling unit in S or T , and iterate the above alignment process until no more alignment is available. With the above process, to tackle many-to-many word alignment problem, the DictAlign algorithm is formulated as follows: Given a pair of sentences ( S , T ) , each as a sequence of content-word link units, S {s1 , s2 ,, sm } , and T {t1 , t2 ,, tn } , the similarity sim ( s, t ) for each link unit pair ( s, t ) in S T is first computed. (1) For each link unit si S and t j T , construct their map sets. (2) Select the link unit pair (si , t j ) arg max sim (s, t ) , if sim ( si , t j ) 0 . If ( s , t )S T not available, then exit. Zhiming Xu, Jonathan J. Webster and Chunyu Kit 232 (3) Select the element pair ( set1 , set 2 ) arg max ( x , y )2 map( si ) 2 (4) Align sim ( x, y ) . map( t j ) (set1, set2 ) , and let S S set1 , T T set2 . (5) Go to (1). We gave an example to illustrate how DictAlign works step by step. Here is a pair of bilingual sentences as followings. English: Get the children up. Chinese: 讓孩子們起床。 During monolingual link unit identification phase, the English sentence may be segmented into a few of fragments, “get / the / children / up/.”, “get up” appears in incontinuous form. How can we recognize “get” and “up ” as a whole multi-word unit, and align them with “起床”? Compared with the statistical approach, the dictionary knowledge is more reliable. In this paper, we use the dictionary knowledge to distinguish if some fragments consist of a multi-word unit so that we filter out impossible fragment combinations. The pair of sentences is segmented into a sequence of fragments English: Get /the /children /up/. Chinese: 讓/孩子們/起床/。 Their fragments will generate a pair of link unit sets: S = {“get”, “the”, “children”, “up”}. T = {“讓”, “孩子們”, “起床”} DictAlign will align them along the following steps. 1. Using Equation.1, we compute word similarity between any pair of link units; these similarity values construct a similarity matrix on S T . 2. Select a pair of link units that are of maximum word similarity. Here, we assume similarity (“起床”, “up”) is maximum, because DictAlign is kind of best-first algorithm, it will select the pair of words,“起床”and “up” to align them firstly. 3. Construct the map set for each element of S or T by use of the following equations. c S , map(ci ) {e T | c S , sim (ci , e) sim(c, e)} For eT , map(e j ) {c S | e T , sim (c, e j ) sim (c, e)} Each link unit t in map(s ) takes s as the most possible ict (t ) . For example, For Map(“起床”) = {“get”, “up”} A New Dictionary-based Word Alignment Algorithm 233 Map(“up”) = {“起床”} map( u ) 4. Construct the power set of the map (u), and denote it as 2 . It constructs the space of possible fragment combinations; we select a pair of fragment combinations from the space, and align them. The power set of Map(“起床”), 2 The power set of Map(“up”), map("起床 ") {{“get”}, { “up” }, { “get”, “up” }}. 2 map("up") {{ “起床”}}. Now, we will select the link unit pair (u,v) from 2 map("起床 ") 2 map("up") , which maximizes similarity (u,v). here, (u,v) = ({ “get”, “up” },{ “起床”}). 5. Finally, we align ({ “get”, “up” },{ “起床”}), and move them into aligned result. Then iterate the above process. 5. Experimental results This section presents our experimental results using the DictAlign algorithm on English-Chinese bilingual texts of Hong Kong laws. The bilingual dictionary we use includes 134,497 bilingual word pairs and 25214 bilingual legal term pairs. A stop list is used to filter out function (or non-content) words in each language. The training corpus consists of 238,271 aligned bilingual sub-paragraph pairs from BLIS corpus, among which many are subparagraph legal text items. In total, 38MB English texts and 23MB Chinese texts are used, respectively, to train the unigram models for link unit segmentation for the two languages. A sub-corpus of the first 205 sub-paragraph pairs is selected as test data to evaluate the performances of the DictAlign algorithm and the baseline word alignment method. Their performance is presented in Table 1 for comparison. We can see that DictAlign achieves a significantly higher precision than the baseline, although its coverage rate is lower. The majority of alignment errors that DictAlign makes are resulted from partial-matching. On the contrary, the majority of alignment errors by the baseline alignment are mismatches. The precision and coverage rate of word alignment are defined as follows. precision number of correctly aligned content link units in test corpus number of aligned content link units in test corpus (3) Zhiming Xu, Jonathan J. Webster and Chunyu Kit 234 cov erage rate number of aligned content link units in test corpus number of content link units in test corpus (4) Table 1: The alignment performance: DictAlign vs. the baseline Algorithm DictAlign Language English Chinese English Chinese Words 4970 5506 (4970) (5506) Link units 1838 1950 (1838) (1950) 1257 1265 1747 1747 Test corpus Aligned link Baseline units Experimental Partial-matching 40 40 36 36 results Mismatching 14 14 262 262 Coverage rate 68.39% 64.87% 95.05% 89.59% Precision 95.70% 95.73% 83.00% 83.00% The example given below illustrates how the link unit segmentation and the word alignment are carried step by step. (1) Input: a pair of un-segmented sub-paragraphs <s id=1> 本條例旨在綜合和修訂有關法例的釋疑﹑適用範圍﹑釋義的法律﹐ 設立 關於這些事宜的一般條文﹐對法例和公共文件中的詞語和詞句下定義﹐訂立關於 公職人員﹑政府和公共機構合約﹑民事和刑事程序的一般條文﹐以及為由這些事 附帶引起的或與這些事相關的目的和事宜訂立一般條文。 <s id=1> To consolidate and amend the law relating to the construction, application and interpretation of laws, to make general provisions with regard thereto, to define terms and expressions used in laws and public documents, to make general provision with regard to public officers, public contracts and civil and criminal proceedings and for purposes and for matters incidental thereto or connected therewith. (2) Link unit segmentation result <s id=1> |本條例|旨在|綜合|和|修訂|有關|法例|的|釋疑|﹑|適用範圍|﹑|釋義| 的|法律|﹐ |設立|關於|這些|事宜|的|一般條文|﹐|對|法例|和|公共文件|中的| 詞語|和|詞句|下定義|﹐|訂立|關於|公職人員|﹑|政府和公共機構合約|﹑|民事| 和|刑事程序|的|一般條文|﹐|以及|為|由|這些|事|附帶引起|的|或|與|這些|事| 相關的|目的|和|事宜|訂立一般條文|。 <s id=1> |to|consolidate|and|amend|the|law|relating to|the|construction|,|application|and |interpretation|of|law|,|to|make general provision|with|regard|thereto|,|to|define|terms|and A New Dictionary-based Word Alignment Algorithm |expressions|used|in|law|and|public|documents|,|to|make 235 general provision|with regard to|public|officers|,|public contract|and|civil|and|criminal proceedings|and|for|purposes|and |for|matter|incidental|thereto|or|connected|therewith|. (3) Word alignment output from DictAlign Line Chinese link units (position) English link units (position) Type (m : n) 1 [政府和公共機構合約(38)] [public contract(41)] 1:1 2 [訂立一般條文(61)] [make general provision(36)] 1:1 3 [訂立(17)] [一般條文(22)] [make general provision(18)] 2:1 4 [附帶引起(51)] [incidental(52)] 1:1 5 [適用範圍(11)] [application(11)] 1:1 6 [下定義(32)] [define(24)] 1:1 7 [民事(40)] [civil(43)] 1:1 8 [刑事程序(42)] [criminal proceedings(45)] 1:1 9 [釋疑(9)] [construction(9)] 1:1 10 [公共文件(27)] [public(32)] 1:1 11 [釋義(13)] [interpretation(13)] 1:1 12 [詞句(31)] [expressions(27)] 1:1 13 [事宜(20)] [matter(51)] 1:1 14 [有關(6)] [relating to(7)] 1:1 15 [公職人員(36)] [public(38)] 1:1 16 [綜合(3)] [consolidate(2)] 1:1 17 [法律(15)] [law(15)] 1:1 18 [修訂(5)] [amend(4)] 1:1 19 [法例(7)] [law(6)] 1:1 20 [法例(25)] [law(30)] 1:1 The above also exemplifies that the majority of alignments belong to one-to-one (1:1) type, and the aligned bilingual link units in the two languages are, in general, not too far away from each other. Line 3 presents a two-to-one (2:1) alignment type, showing two discontinues Chinese link units “訂立 (17)” and “一般條文 (22)” are aligned with one English link unit “make general provision (18)”. Line 10 shows an erroneous alignment, where “公共文件 (27)” is partially aligned with “public (32)”. The original bilingual text shows that “公共文件 (27)” should be aligned with “public (32)” and “documents (33)”. Errors of this type mostly result from inadequate lexical resources. For example, “public documents” is an OOV link unit that our dictionary-based approach is unable to cope with. 236 Zhiming Xu, Jonathan J. Webster and Chunyu Kit Line 15 is another error caused by, again, inadequate lexical resources. At the price of such errors, however, this dictionary-based approach is particularly powerful in handling IV items, achieving a promising alignment precision. 6. Conclusion and future work In this paper we have presented a dictionary-based approach to word alignment, including the monolingual link unit segmentation method, the formulation of an empirical measure for similarity of candidate word pair, the DictAlign algorithm for word alignment, and the experimental results on English-Chinese bilingual corpus of Hong Kong laws. The research work reported here is part of an ongoing EBMT project whose current phase is focusing on the automatic acquisition of high quality translation equivalents (or examples) through text alignment at various levels of granularity, including word, phrase (or chunk) and clause. A high alignment precision is demanded for the purpose of ensuring the quality of the examples so acquired. Our experiments have shown that the word similarity measure based on the longest common subsequence ratio results in a remarkable alignment precision. Besides, the DictAlign algorithm is language-independent and can be applied to other language pairs. Our future work will focus on integrating the DictAlign algorithm with a statistical word alignment method for performance improvement, to improve its coverage rate while maintaining its alignment precision. A particular problem that we need to tackle is how to estimate the word similarity for OOV link units in a reliable way so as to reduce the number of alignment errors caused by partial matching. Acknowledgements The work reported in this paper is fully funded by the Research Grants Council of HKSAR, China through the grant #9040482 for a CERG project, with Jonathan J. Webster as the principal investigator and Chunyu Kit, Caesar S. Lun, Haihua Pan, King Kuai Sin and Vincent Wong as investigators. The authors wish to thank all team members for their contribution on this paper. References A New Dictionary-based Word Alignment Algorithm 237 Ahrenberg, L., Merkel, M., Hein, A. S., Tiedemann, J., 2000, Evaluation of word alignment systems, In Proceedings of the Second International Conference on Linguistic Resources and Evaluation (LREC 2000), Athens, Greece, 31 May - 2 June, 2000, Volume III: pp. 1255-1261. Brown, R. D., 1997, Automated dictionary extraction for “knowledge-free” example-based translation. In Seventh International Conference on Theoretical and Methodological Issues in Machine Translation (TMI 97), Santa Fe, New Mexico, July, pp. 111-118. Brown, P. F., Della Pietra, S. A., Della Pietra, V. J. and Mercer, R. L., 1993, The mathematics of statistical machine translation: parameter estimation, Computational Linguistics, 19(2):263-311. Dempster, A. P., Laird, N. M. and Rubin, D. B., 1977, Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1-22. Ker, S. J. and Chang, J. S., 1997, A class-based approach to word alignment, Computational Linguistics, 23(2):314-343. Kit, C. Y., Xu, Z. M. and Webster, J. J., 2003, Integrating ngram model and case-based learning for Chinese word segmentation. In Qing Ma & Fei Xia (ed.), SIGHAN-2, Sapporo, pp. 160-163. Knight, K., 1999, Decoding complexity in word-replacement translation models, Computational Linguistics, 25(4):607-615. Melamed, I. D., 2000, Models of translational equivalence among words, Computational Linguistics, 26(2):221-249. Och, F. J. and Ney, H., 2003, A systematic comparison of various statistical alignment models, Computational Linguistics, 29(1):19-51. Somers, H. L., 2000, Example-based machine translation, In R. Dale, H. Moisl, and H. Somers, editors, Handbook of Natural Language Processing, Marcel Dekker, New York, pp. 611-627. Tiedemann, J., 1999, Word alignment - step by step, In Proceedings of the 12th Nordic Conference on Computational Linguistics (NODAL-IDA), University of Trondheim, Norway, pp. 216-227. Wu, D., 1996, A polynomial-time algorithm for statistical machine translation, In ACL’96, Santa Cruz, California, June, pp. 152-158.