Word File

advertisement
Journal of Chinese Language and Computing 16 (4): 225-237
225
A New Dictionary-based Word Alignment Algorithm
Zhiming Xu*† , Jonathan J. Webster* and Chunyu Kit*
*
Department of Chinese, Translation and Linguistics, City university of Hong Kong,
Tat Chee Ave., Kowloon, Hong Kong
†
School of Computer Science and Technology, Harbin Institute of Technology, P. R.C.
China
Box 321, Harbin Institute of Technology, Harbin, China 150001
xuzm@hit.edu.cn
ctjjw@cityu.edu.hk
ctckit@cityu.edu.hk
Abstract
This paper presents our recent work on dictionary-based word alignment for the
English-Chinese bilingual corpus of Hong Kong legal texts. It is intended to address a
number of critical issues involved in word alignment, including monolingual
pre-processing for identifying text items for alignment, proper formulation of similarity
measure for bilingual word pairs, and the novel treatment of many-to-many word
alignment. A dictionary-based word alignment algorithm (DictAlign) is proposed and
tested with the bilingual texts of Hong Kong laws. Experimental results show that this
approach achieves an impressive performance: 95.70% precision and 68.39% coverage
rate for English, and 95.73% precision and 64.87% coverage rate for Chinese.
Keywords:
word alignment, word similarity
1. Introduction
The main task of word alignment is to identify translation equivalents between lexical
items in bilingual parallel texts (or bitexts). In addition to individual words, i.e.,
single-word units, the lexical items to be aligned also include multi-word units, such as
226
Zhiming Xu, Jonathan J. Webster and Chunyu Kit
compound words, terms, idiomatic expressions and some text chunks of other types. The
lexical items for alignment are referred to as link units (Ahrenberg, Magnus, et al, 2000;
Tiedemann, 1999). Bilingual word alignment usually assumes a segment-aligned bitext as
input, where bitext segments may be mutual translations of clauses, sentences, and sentence
sequences (Tiedemann, 1999). In general word alignment work involves the following
typical tasks: (1) acquisition of necessary resources, including bilingual dictionaries,
bilingual terminology and monolingual stop (or functional) word lists; (2) segmentation of
monolingual sentences into link units; and (3) alignment of bilingual link units in a given
bitext.
Word alignment systems may aim at different purposes, e.g., full-text word alignment vs.
bilingual lexicon extraction (Ahrenberg, Magnus, et al, 2000). Our recent work, as part of
an ongoing research project on EBMT for Hong Kong legal texts, is aimed at full-text word
alignment. The outcomes of this work are expected to facilitate text alignment at higher
linguistic levels, such as phrase/chunk alignment and sentence alignment, and translation
selection (Och and Ney, 2003; Knight, 1999; Wu, 1996).
Our word alignment work addresses both single-word and multi-word link units. The
multi-word link units sometimes appear in corpus in discontinuous form (parts of them are
separated by some determinative, adjectives and so on). It is not surprising that they are
segmented into fragments. Also, because of the imperfect segmentation of link units,
mostly due to segmentation ambiguities, continuous multi-word units may be segmented
into fragments. Problems of this kind make the alignment of these link units more
complicated than the one-to-one word alignment. Once a pair of multi-word link units are
both segmented into several fragments, the many-to-many alignment problem arises,
although the alignment of other types, including one-to-one, one-to-many and many-to-one
alignment, are more popular in practice.
In this paper we will present our recent work on (1) the segmentation of English and
Chinese monolingual Hong Kong legal texts into link units and (2) the word alignment on
these bilingual link units. The remaining sections of the paper are organized as follows.
Section 2 introduces our probabilistic approach to link unit segmentation. Section 3
formulates an empirical similarity measure for identifying word pairs of mutual translation
with the aid of an available bilingual dictionary. Section 4 proposes a dictionary-based
word alignment algorithm, called DictAlign. It copes with all types of alignment of
discontinuous link units. Section 5 presents our experimental results on the BLIS corpus, a
complete collection of English-Chinese bilingual texts of Hong Kong laws, using this
algorithm. It achieves an alignment precision of 97.7%, with a coverage rate of 64.6% and
A New Dictionary-based Word Alignment Algorithm
227
67.0%, respectively, for the Chinese and English texts in the input bilingual corpus. Section
6 concludes the paper and discusses possible directions for future work.
2. Identification of monolingual ling units
2.1 Pre-processing
The proper segmentation of monolingual texts into link units is an essential task to word
alignment if we intend to extend word alignment to handle multi-word link units in
addition to single-word link units. Two typical preprocessing tasks are involved in this
stage of work for the identification of such monolingual link units, namely, lemmatization
and tokenization.
In our work, the process of lemmatization takes place with the preprocessing of English
texts, in order to convert variant word forms back to their lemmas, or base forms. Our
lemmatization module exploits a number of morphological rules to carry out the conversion
for regular nouns and verbs. Irregular nouns and verbs are handled individually with the aid
of a mapping table of irregular word forms and their lemmas.
Tokenization is a process to recognize text tokens for further processing. Here tokens can
be viewed as atomic text units that form the larger text units such as words for later
processing. They are not to be further decomposed into smaller units. In our work, a token
in Chinese is a Chinese character of fixed length (2 bytes) – it is worth noting that the
tokenization here for Chinese does not involve word segmentation; a token in English is a
single word, which is generally delimited by spaces and punctuations. A link unit in both
languages is a sequence of one or more than one token. For example, in the pair of link
units “香港”/ “Hong Kong”,
“香港” consists of two tokens “香” and “港”, and “Hong
Kong ” of two tokens “Hong” and “Kong”.
2.2 Statistical approach to link unit identification
After the pre-processing phase, each sentence S in an input text T is converted into a
token sequence S  t1t2
tn  t1n . The task of statistical link unit segmentation is to find a
link unit sequence over this token sequence S  u1u2
um  u1m that maximizes the
Zhiming Xu, Jonathan J. Webster and Chunyu Kit
228
probability p(u1m ) . We use a link unit-based unigram as language model and
m
define p(u1m )   p(ui ) , where p(ui )  n(ui ) /  n(u) , and n(ui ) is the frequency of the
i 1
uT
link unit ui in T .
The language model is trained on the unsegmented corpus T using EM algorithm
(Dempster, 1977). The training process is similar to the EM-training presented in Kit et al.
(2003) for probabilistic Chinese word segmentation. We train unigram models for English
and Chinese using the BLIS corpus, and utilize them respectively to identity link units in
each language.
3. Bilingual word similarity
A machine-readable bilingual dictionary is the most critical lexical resource in our
dictionary-based approach to word alignment. In general, word alignment approaches can
be categorized into two types: statistical-based, as illustrated, for example, in Brown et al.
(1993) and Melamed (2000), and dictionary-based, as we practice here. We expect that a
dictionary-based word alignment approach can achieve a better performance than statistical
word alignment methods, because the lexical resources that it makes use of is more reliable
than pure statistical information. The lexical resources that DictAlign uses include a
Chinese-English dictionary of frequent words and a bilingual term list.
Given a bilingual sentence pair ( S , T ) , each as a sequence of monolingual link units, the
task of word alignment is to identify as many correspondent pairs of link units as possible
in the sentence pair. For each link unit s in source-language sentence S , we want to find
its counterpart in the target sentence T . This counterpart of s is called its in-context
translation (ICT), denoted as ict ( s ) . On the other hand, its dictionary translation as
extracted from an available bilingual dictionary is called its dictionary translation (DT),
denoted as dt ( s) (Ker and Chang, 1997).
For any bilingual link unit pair (s, t ) in ( S , T ) , we need to estimate the likelihood that
(s, t ) is mutual translation. We can estimate such likelihood in terms of their bilingual
word similarity, denoted as sim( s, t ) . Statistical measures, such as mutual information,
dice coefficients and other statistical metrics, are commonly used to estimate the statistical
association strength or the distribution similarity of a word pair. Here we take a
A New Dictionary-based Word Alignment Algorithm
229
dictionary-based approach to estimate the string similarity between ict ( s ) and t for the
purpose of determining whether (s, t ) are mutual translation.
String similarity can be computed in terms of the longest common subsequence ratio
(LCSR), as illustrated in Tiedemann (1999) for the purpose of determining whether a word
pair is a cognate pair. Here, we also resort to LCSR for estimating the string similarity
between ict ( s ) and t . With regard to the symmetry in word alignment, we define the
bilingual word similarity of a link unit pair (s, t ) as follows:
sim( s, t ) 
where LCSR( x, y) 
LCSR( s, dt (t ))  LCSR(dt (s), t )
2
(1)
length( Longest Common Subsequence( x, y ))
. sim( s, t ) gives a score
max(length( x), length( y))
to indicate likelihood that (s, t ) represents a mutual translations. sim( s, t ) is symmetrical,
i.e., sim(s, t )  sim(t , s) .
To avoid unexpected noises for the word alignment, if the text
strings in the pair (s, dt (t )) or (dt (s), t ) do not include one another, then we
set sim(s, t )  0 .
4. The DictAlign algorithm
Source
s1
s2
s3
s4
s5
s6
Target
t1
t2
t3
t4
t5
t6
Figure 1. Direct link digraph of word alignment
Figure 1 shows an bilingual sentence pair ( S , T ) , where S  {s1 , s2 ,
T  {t1 , t2 ,
, sm } and
, tn } , each being a set of content-word link units after the monolingual link
unit segmentation phase and filtering out functional words. DictAlign, our dictionary-based
word alignment algorithm, is designed to find as many proper pairs of link units as possible
in the product set S  T .
Zhiming Xu, Jonathan J. Webster and Chunyu Kit
230
4.1 Baseline algorithm of word alignment
In order to evaluate the performance of the DictAlign algorithm by comparison, we firstly
implement a one-to-one word alignment algorithm as a baseline, using dice coefficients as
similarity measure. It is given as follows.
Given a pair of sentences ( S , T ) , each as a sequence of content-word link units,
S  {s1 , s2 ,, sm } , and T  {t1 , t2 ,, tn } , the similarity sim ( s, t ) for each
link unit pair ( s, t ) in S  T is first computed.
(1) Select the link unit pair
(si , t j )  arg max sim (s, t ) , if sim ( si , t j )  0 . If
( s , t )S T
not available, then exit.
(2) Align ( si , t j ) , and let
S  S  {si } , T  T  {t j } .
(3) Go to (1).
4.2 DictAlign word alignment algorithm
The bilingual dictionary used in our research subsumes a bilingual glossary of Hong Kong
laws, with many multi-word terms that sometimes appear in a discontinuous form in the
corpus. It is also quite common that even the continuous ones are segmented into several
fragments by the process of link unit segmentation, mostly due to segmentation
ambiguities. If a multi-word unit and its ICT are both in fragments, we have a
many-to-many mapping problem in our alignment work. If only one of the two counterparts
is in fragments, we have a one-to-many (or many-to-one) mapping problem, a special case
of many-to-many mapping. According to our observation, such special cases take place
more often than the many-to-many alignment, which takes place occasionally. In order to
decide whether two given sets ( set1 , set2 ) of such fragments should be aligned with each
other
we
need
and set2  {tn1 , tn2
to
calculate
their
similarity.
Assuming
set1  {sm1 , sm2
smk }
tnl } , each being a sequence of segmented fragments, we define their
similarity as follows:
sim(set1 , set2 )  sim(sm1 sm2
smk , tn1 tn2
tnl )
(2)
A New Dictionary-based Word Alignment Algorithm
231
where o denotes string concatenation. The computation involved in (2) is identical to that
for (1), except the concatenation before the computation. It is worth noting, however, that
this formula is effective in handling fragmental in-vocabulary (IV) link units, not any
out-of-vocabulary (OOV) ones. Given a candidate pair of fragmental link units, if one is IV
in this language and the other OOV in another language, they may only be partially aligned
using (2).
Firstly, to align fragmental link units, we need to construct the map set map(u) for
each
link
unit
in
S
or
.
T
map(si )  {t  T | s  S , sim (si ,t )  sim (s ,t )}
;
For
For
each
each
si  S
,
tj T
,
map(t j )  {s  S | t  T , sim(s, t j )  sim(s , t )} . That is, a link unit t in map( s) takes
s as the most possible ict (t ) . Their relation is denoted by a direct link: t   s , as in
Figure 1. The length of t   s is defined as sim(t , s) . Once the map set of each link unit
in S or T is constructed, we select the longest direct link si   t j from all the direct
links in S  T , with sim( si , t j )  0 . If no direct link available, we are done; otherwise, for
a
selected
direct
elements (set1 , set2 ) 
link
arg max
( x , y )2
map ( si )
si   t j
,
we
seek
the
pair
of
sim( x, y) , where 2map ( u ) denotes the power set of
map ( t j )
2
map(u) . Then we align ( set1 , set2 ) , and let S  S  set1 , T  T  set2 .
Next we
re-construct the map set of each unaligned ling unit in S or T , and iterate the above
alignment process until no more alignment is available.
With the above process, to tackle many-to-many word alignment problem, the DictAlign
algorithm is formulated as follows:
Given a pair of sentences ( S , T ) , each as a sequence of content-word link units,
S  {s1 , s2 ,, sm } , and T  {t1 , t2 ,, tn } , the similarity sim ( s, t ) for each
link unit pair ( s, t ) in S  T is first computed.
(1) For each link unit
si  S and t j  T , construct their map sets.
(2) Select the link unit pair
(si , t j )  arg max sim (s, t ) , if sim ( si , t j )  0 . If
( s , t )S T
not available, then exit.
Zhiming Xu, Jonathan J. Webster and Chunyu Kit
232
(3) Select the element pair ( set1 , set 2 ) 
arg max
( x , y )2 map( si )  2
(4) Align
sim ( x, y ) .
map( t j )
(set1, set2 ) , and let S  S  set1 , T  T  set2 .
(5) Go to (1).
We gave an example to illustrate how DictAlign works step by step. Here is a pair of
bilingual sentences as followings.
English: Get the children up.
Chinese: 讓孩子們起床。
During monolingual link unit identification phase, the English sentence may be
segmented into a few of fragments, “get / the / children / up/.”, “get up” appears in
incontinuous form. How can we recognize “get” and “up ” as a whole multi-word unit, and
align them with “起床”? Compared with the statistical approach, the dictionary knowledge
is more reliable. In this paper, we use the dictionary knowledge to distinguish if some
fragments consist of a multi-word unit so that we filter out impossible fragment
combinations.
The pair of sentences is segmented into a sequence of fragments
English: Get /the /children /up/.
Chinese: 讓/孩子們/起床/。
Their fragments will generate a pair of link unit sets:
S = {“get”, “the”, “children”, “up”}.
T = {“讓”, “孩子們”, “起床”}
DictAlign will align them along the following steps.
1. Using Equation.1, we compute word similarity between any pair of link units; these
similarity values construct a similarity matrix on S  T .
2. Select a pair of link units that are of maximum word similarity.
Here, we assume similarity (“起床”, “up”) is maximum, because DictAlign is kind of
best-first algorithm, it will select the pair of words,“起床”and “up” to align them firstly.
3. Construct the map set for each element of S or T by use of the following equations.
c  S , map(ci )  {e  T | c  S , sim (ci , e)  sim(c, e)}
For eT , map(e j )  {c  S | e  T , sim (c, e j )  sim (c, e)}
Each link unit t in map(s ) takes s as the most possible ict (t ) . For example,
For
Map(“起床”) = {“get”, “up”}
A New Dictionary-based Word Alignment Algorithm
233
Map(“up”) = {“起床”}
map( u )
4. Construct the power set of the map (u), and denote it as 2
. It constructs the space
of possible fragment combinations; we select a pair of fragment combinations from the
space, and align them.
The power set of Map(“起床”), 2
The power set of Map(“up”),
map("起床 ")
 {{“get”}, { “up” }, { “get”, “up” }}.
2 map("up")  {{ “起床”}}.
Now, we will select the link unit pair (u,v) from 2
map("起床 ")
 2 map("up") , which maximizes
similarity (u,v). here, (u,v) = ({ “get”, “up” },{ “起床”}).
5. Finally, we align
({ “get”, “up” },{ “起床”}), and move them into aligned result. Then
iterate the above process.
5. Experimental results
This section presents our experimental results using the DictAlign algorithm on
English-Chinese bilingual texts of Hong Kong laws. The bilingual dictionary we use
includes 134,497 bilingual word pairs and 25214 bilingual legal term pairs. A stop list is
used to filter out function (or non-content) words in each language.
The training corpus
consists of 238,271 aligned bilingual sub-paragraph pairs from BLIS corpus, among which
many are subparagraph legal text items. In total, 38MB English texts and 23MB Chinese
texts are used, respectively, to train the unigram models for link unit segmentation for the
two languages. A sub-corpus of the first 205 sub-paragraph pairs is selected as test data to
evaluate the performances of the DictAlign algorithm and the baseline word alignment
method. Their performance is presented in Table 1 for comparison. We can see that
DictAlign achieves a significantly higher precision than the baseline, although its coverage
rate is lower. The majority of alignment errors that DictAlign makes are resulted from
partial-matching. On the contrary, the majority of alignment errors by the baseline
alignment are mismatches. The precision and coverage rate of word alignment are defined
as follows.
precision 
number of correctly aligned content link units in test corpus
number of aligned content link units in test corpus
(3)
Zhiming Xu, Jonathan J. Webster and Chunyu Kit
234
cov erage rate 
number of aligned content link units in test corpus
number of content link units in test corpus
(4)
Table 1: The alignment performance: DictAlign vs. the baseline
Algorithm
DictAlign
Language
English
Chinese
English
Chinese
Words
4970
5506
(4970)
(5506)
Link units
1838
1950
(1838)
(1950)
1257
1265
1747
1747
Test corpus
Aligned
link
Baseline
units
Experimental
Partial-matching
40
40
36
36
results
Mismatching
14
14
262
262
Coverage rate
68.39%
64.87%
95.05%
89.59%
Precision
95.70%
95.73%
83.00%
83.00%
The example given below illustrates how the link unit segmentation and the word
alignment are carried step by step.
(1) Input: a pair of un-segmented sub-paragraphs
<s id=1> 本條例旨在綜合和修訂有關法例的釋疑﹑適用範圍﹑釋義的法律﹐ 設立
關於這些事宜的一般條文﹐對法例和公共文件中的詞語和詞句下定義﹐訂立關於
公職人員﹑政府和公共機構合約﹑民事和刑事程序的一般條文﹐以及為由這些事
附帶引起的或與這些事相關的目的和事宜訂立一般條文。
<s id=1> To consolidate and amend the law relating to the construction, application and
interpretation of laws, to make general provisions with regard thereto, to define terms and
expressions used in laws and public documents, to make general provision with regard to
public officers, public contracts and civil and criminal proceedings and for purposes and
for matters incidental thereto or connected therewith.
(2) Link unit segmentation result
<s id=1> |本條例|旨在|綜合|和|修訂|有關|法例|的|釋疑|﹑|適用範圍|﹑|釋義|
的|法律|﹐ |設立|關於|這些|事宜|的|一般條文|﹐|對|法例|和|公共文件|中的|
詞語|和|詞句|下定義|﹐|訂立|關於|公職人員|﹑|政府和公共機構合約|﹑|民事|
和|刑事程序|的|一般條文|﹐|以及|為|由|這些|事|附帶引起|的|或|與|這些|事|
相關的|目的|和|事宜|訂立一般條文|。
<s id=1> |to|consolidate|and|amend|the|law|relating to|the|construction|,|application|and
|interpretation|of|law|,|to|make general provision|with|regard|thereto|,|to|define|terms|and
A New Dictionary-based Word Alignment Algorithm
|expressions|used|in|law|and|public|documents|,|to|make
235
general
provision|with
regard
to|public|officers|,|public contract|and|civil|and|criminal proceedings|and|for|purposes|and
|for|matter|incidental|thereto|or|connected|therewith|.
(3) Word alignment output from DictAlign
Line
Chinese link units (position)
English link units (position)
Type (m : n)
1
[政府和公共機構合約(38)]
[public contract(41)]
1:1
2
[訂立一般條文(61)]
[make general provision(36)]
1:1
3
[訂立(17)] [一般條文(22)]
[make general provision(18)]
2:1
4
[附帶引起(51)]
[incidental(52)]
1:1
5
[適用範圍(11)]
[application(11)]
1:1
6
[下定義(32)]
[define(24)]
1:1
7
[民事(40)]
[civil(43)]
1:1
8
[刑事程序(42)]
[criminal proceedings(45)]
1:1
9
[釋疑(9)]
[construction(9)]
1:1
10
[公共文件(27)]
[public(32)]
1:1
11
[釋義(13)]
[interpretation(13)]
1:1
12
[詞句(31)]
[expressions(27)]
1:1
13
[事宜(20)]
[matter(51)]
1:1
14
[有關(6)]
[relating to(7)]
1:1
15
[公職人員(36)]
[public(38)]
1:1
16
[綜合(3)]
[consolidate(2)]
1:1
17
[法律(15)]
[law(15)]
1:1
18
[修訂(5)]
[amend(4)]
1:1
19
[法例(7)]
[law(6)]
1:1
20
[法例(25)]
[law(30)]
1:1
The above also exemplifies that the majority of alignments belong to one-to-one (1:1)
type, and the aligned bilingual link units in the two languages are, in general, not too far
away from each other. Line 3 presents a two-to-one (2:1) alignment type, showing two
discontinues Chinese link units “訂立 (17)” and “一般條文 (22)” are aligned with one
English link unit “make general provision (18)”. Line 10 shows an erroneous alignment,
where “公共文件 (27)” is partially aligned with “public (32)”. The original bilingual text
shows that “公共文件 (27)” should be aligned with “public (32)” and “documents (33)”.
Errors of this type mostly result from inadequate lexical resources. For example, “public
documents” is an OOV link unit that our dictionary-based approach is unable to cope with.
236
Zhiming Xu, Jonathan J. Webster and Chunyu Kit
Line 15 is another error caused by, again, inadequate lexical resources. At the price of such
errors, however, this dictionary-based approach is particularly powerful in handling IV
items, achieving a promising alignment precision.
6. Conclusion and future work
In this paper we have presented a dictionary-based approach to word alignment, including
the monolingual link unit segmentation method, the formulation of an empirical measure
for similarity of candidate word pair, the DictAlign algorithm for word alignment, and the
experimental results on English-Chinese bilingual corpus of Hong Kong laws. The research
work reported here is part of an ongoing EBMT project whose current phase is focusing on
the automatic acquisition of high quality translation equivalents (or examples) through text
alignment at various levels of granularity, including word, phrase (or chunk) and clause.
A high alignment precision is demanded for the purpose of ensuring the quality of the
examples so acquired. Our experiments have shown that the word similarity measure based
on the longest common subsequence ratio results in a remarkable alignment precision.
Besides, the DictAlign algorithm is language-independent and can be applied to other
language pairs. Our future work will focus on integrating the DictAlign algorithm with a
statistical word alignment method for performance improvement, to improve its coverage
rate while maintaining its alignment precision. A particular problem that we need to tackle
is how to estimate the word similarity for OOV link units in a reliable way so as to reduce
the number of alignment errors caused by partial matching.
Acknowledgements
The work reported in this paper is fully funded by the Research Grants Council of HKSAR,
China through the grant #9040482 for a CERG project, with Jonathan J. Webster as the
principal investigator and Chunyu Kit, Caesar S. Lun, Haihua Pan, King Kuai Sin and
Vincent Wong as investigators. The authors wish to thank all team members for their
contribution on this paper.
References
A New Dictionary-based Word Alignment Algorithm
237
Ahrenberg, L., Merkel, M., Hein, A. S., Tiedemann, J., 2000, Evaluation of word alignment
systems, In Proceedings of the Second International Conference on Linguistic
Resources and Evaluation (LREC 2000), Athens, Greece, 31 May - 2 June, 2000,
Volume III: pp. 1255-1261.
Brown, R. D., 1997, Automated dictionary extraction for “knowledge-free” example-based
translation. In Seventh International Conference on Theoretical and Methodological
Issues in Machine Translation (TMI 97), Santa Fe, New Mexico, July, pp. 111-118.
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J. and Mercer, R. L., 1993, The
mathematics of statistical machine translation: parameter estimation, Computational
Linguistics, 19(2):263-311.
Dempster, A. P., Laird, N. M. and Rubin, D. B., 1977, Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B,
39(1):1-22.
Ker, S. J. and Chang, J. S., 1997, A class-based approach to word alignment,
Computational Linguistics, 23(2):314-343.
Kit, C. Y., Xu, Z. M. and Webster, J. J., 2003, Integrating ngram model and case-based
learning for Chinese word segmentation. In Qing Ma & Fei Xia (ed.), SIGHAN-2,
Sapporo, pp. 160-163.
Knight, K., 1999, Decoding complexity in word-replacement translation models,
Computational Linguistics, 25(4):607-615.
Melamed, I. D., 2000, Models of translational equivalence among words, Computational
Linguistics, 26(2):221-249.
Och, F. J. and Ney, H., 2003, A systematic comparison of various statistical alignment
models, Computational Linguistics, 29(1):19-51.
Somers, H. L., 2000, Example-based machine translation, In R. Dale, H. Moisl, and H.
Somers, editors, Handbook of Natural Language Processing, Marcel Dekker, New York,
pp. 611-627.
Tiedemann, J., 1999, Word alignment - step by step, In Proceedings of the 12th Nordic
Conference on Computational Linguistics (NODAL-IDA), University of Trondheim,
Norway, pp. 216-227.
Wu, D., 1996, A polynomial-time algorithm for statistical machine translation, In ACL’96,
Santa Cruz, California, June, pp. 152-158.
Download