Expectation on the Future of MT Technology

advertisement
Bigger Context and Better Understanding
-- Expectation on Future MT Technology
Zhendong Dong
E-mail: dzddong@public.bta.net.cn
Abstract
This paper is intended to discuss new technology for next-generation machine
translation. After a brief review of the problems in the state of the art of MT is made, the discussion
is focused on two aspects: common-sense knowledge database and multi-sentential processing.
Construction and application of new knowledge resources are emphasized and a concise
presentation of HowNet, a Chinese-English knowledge system, is given. The paper gives a practical
description of how understanding-based MT will be implemented.
Key words
machine translation MT discourse text understanding parsing
sense disambiguation HowNet knowledge dictionary
1. Motivation
Nowadays in the Internet with the topic of machine translation, we can find quite a lot of interesting
stuff: free MT services, commercial products on sale, short-term courses on MT, call for papers for
MT Summit or regional conferences, continental associations in Americas, Asia and Europe,
publications including professional books or journals, and academic papers. However we also see
some different pictures, among which the most impressive one to me is a new web site named
madtrans@usa.net. The name of the site suggests, as the participants meant, that MT should stand
for Mad Translation rather than Machine Translation. The participants of the site claim, "The
purpose of this site is to re-establish the truth about the real (im)possibilities of Machine
Translation systems (also called automatic translators). This is a quite complete synthesis with
automatic translation examples, analyses from specialists and even expectations about the future of
this technology." In the article they present many funny examples given by MT systems, make long
citations about the reasons why computers cannot translate as human translators do, and they urge,
"If you choose to purchase an automatic translator in spite of our advice and the examples we have
shown you, be prepared to waste your money ".
As a researcher and developer who has unfortunately been stuck in MT for nearly 30 years, I would
be really reluctant to agree on their views in general, and the "expectations about the future of this
technology" in particular. However, this paper is not intended to argue, but to discuss the future of
MT technology. The discussion will cover what an innovation in future MT technology is. The
innovation will be represented by three aspects: (1) single-sentence processing will be replaced by
sentence-group (discourse) processing; (2) change in the depth of source language parsing (or more
accurately, analysis); (3) novel device of target language generation.
2. Real Problems
I claim that I don't agree some conclusions of the madtrans@usa.net participants, but I admit the
problems in MT they numerated do exist, though they may be not the key problems. What are the
real key problems in MT then?
Firstly, let's imagine. A human translator handles one and only one sentence at a time when he is
doing translation. Besides, when he translates a sentence, he has no idea at all about the previous
one he just deals with nor the next one he is going to handle. Have you ever come across such kind
of human translators? If there were one like this, could he do a good translation job? Do you want
to hire a translator like this? The answer might definitely be negative. In reality none of human
translators translate on the basis of single-sentence context processing. But unfortunately this is just
the same case with MT, and this is one of the key problems which make it mad.
Secondly, let's take the following sentence as an example. As far as I know most English-Chinese
MT systems available in China are able to give correct translation for the English sentence "Mr.
Nixon was sick and sent to hospital this morning". However, I am sure that none of the systems
can give correct answer even for a simple question like "who was ill?" or "where was Mr. Nixon
sent?". Can you imagine the same situation would happen to a human translator? It is really
unbelievable that translation can be done without understanding.
To summarize, the present technology of MT has two principal problems: one is single-sentence
context processing, the other is lack of basis of understanding. In comparison with these two
problems, the controversy about "mainstream linguistics', 'universal grammar', 'statistical modeling',
'derivation trees', 'parsing algorithms', or "interlingual", "transfer" etc. is merely a minor one. We
take these two problems as the typical "traditional patterns which will clearly have to be broken if
any real progress is to be made" [kay96].
To expand the scope of processing from single-sentence to multiple-sentence or sentence-group,
and to achieve understanding of source language and to achieve real re-compilation in target
language generation to get rid of the shadow of the syntactic detainment of the source language -these two techniques, we believe, will be the ice-breaker for new era MT.
3. Sentence-group Processing and Understanding
3.1 Sentence-group
By sentence-group, we mean more than one sentence in series in a text. They may be a paragraph
or even bigger context, or just two or three sentences within a paragraph. The number of the
sentences to be processed is decided by the system designer and we consider it as flexible or
adjustable to cater for the need of different kinds of texts.
Single-sentence or sentence-group, is by no means a question of quantity, or an issue of memory
space of a computer, but it is an issue of overall innovation for next-generation MT technology.
Within the traditional patterns, the parser of an MT system make a syntactic tree of a processed
sentence, one tree for one sentence. The trees of adjacent sentences are totally irrelevant. No
relations can be formed between those trees even the parser keep the parsing results for all
sentences. What relations can we build between syntactic trees? Can the parser specify any
relations between the subject(NP) in the first sentence and the predicate(VP) in the next sentence?
However in reality there are some meaningful links among sentences in a sentence-group or a
paragraph unless the text is awkwardly or incoherently written. These links of meaning are critical
for human translators to have good understanding of the text. Without these links of meaning,
human translators sometimes will be at a loss if he is going tackle an isolated (deliberately made)
sentence, like "he met John near the bank". Only by a bigger context can we achieve more
understanding.
3.2 Links of meaning
It is not easy to expand the context from one sentence to more and to have better understanding.
The key factor is to build up the links of meaning between sentences. If we can not achieve this,
bigger context will be useless. What kind of links are we to build then? The links of meaning
include three kinds of relation:
(1) concepts relations;
(2) relations between events (usually denoted by verbs, especially main verbs);
(3) the shifting of the roles of the events.
Now let's take the following sentences as an example:
In the past few years the Tanakas bought one or two toys for their children
every time they travelled abroad. It is really a pity that now they lost nearly all
of the them.
After the "parsing" of these sentences by a future MT system, the three kinds of relation will be
obtained as follows:
(1) concepts relations (only main nodes):
bought(buy|) --- main-event1
years(time|) --- duration
Tanakas(human|) --- agent
toys(tool|) --- possession
children(human|) --- beneficiary
every time(time|) --- time
travelled(tour|) --- main-event1.1
they1(human|) --- agent
abroad(location|) --- location
lost(lose|) --- main-event2
It is really a pity that(comment|) --- comment
they2(human|) --- relevant
them3(?) --- possession
(2) relations between events
i
have| = result of buy|
ii
iii
pay| = precondition of buy|
have = precondition of lose|
(3) role shifting rules (RSR)
a. beneficiary of buy| --> relevant of have|
b. agent of buy| --> relevant of have| {if no other beneficiary}
c. possession of buy| --> possession of have|
d. relevant of have| --> relevant of lose|
e. possession of have| --> possession of lose|
they2 = relevant of lose|
according to RSR-d. --> relevant of have|
according to RSR-a. --> beneficiary of buy| (children)
them3 = possession of lose|
according to RSR-e. --> possession of have|
according to RSR-c. --> possession of buy| (toys)
Note: 1. a, b, c, d, e, listed above are the rules which control role shifting
2. The word with a symbol "|" represents a concept
The role shifting process gives plausible answers to the questions of "who" and "what". This is a
reliable approach to solve anaphora problem. Meanwhile this will also be a reliable approach to
solve ellipsis. By the way, if the above example sentence is in idiomatic Chinese, the problem will
turn into an ellipsis one, that is, both the subject and object of "lost" would be omitted.
4. New Technology
For new technology of next-generation MT, our discussion will be focused on two points: general
knowledge database and multi-sentential processing.
The discussion in 3.2 clearly shows that we need some new resources. The traditional MT
dictionary (with syntactic information, and some fragmental semantic information, i.e. confined in
language knowledge only, or with only very little world knowledge) will not be able to meet the
need to operate as described in 3.2. New technology needs new resources among which the critical
one is a general knowledge database.
We would like to take HowNet as an example to give a comprehensive picture of a general
knowledge database and its application to machine translation. HowNet was released on the site of
www.how-net.com in March 1999.
4.1 HowNet – A General Knowledge Database
HowNet is a general knowledge database which describes relations between concepts and relations
between the attributes of concepts. The concepts HowNet defines are denoted by words and phrases
in Chinese and English. We regard HowNet as a Chinese-English bilingual common-sense
knowledge system.
4.1.1 Sub-databases of HowNet
HowNet is mainly composed of 9 sub-databases. They are as follows:
(1) Chinese-English Bilingual Knowledge Dictionary (CEKD)
(2) Main Features of Concepts (1) (MFC-1)
(3) Main Features of Concepts (2) (MFC-2)
(4) Secondary Features of Concepts (1) (SFC-1)
(5) Secondary Features of Concepts (2) (SFC-2)
(6) Secondary Features of Concepts (3) (SFC-3)
(7) Event Role and Features (ERF)
(8) List of Antonymous Relations (LAR)
(9) List of Converse Relations (LCR)
Apart from these sub-databases, HowNet contains a few descriptive files, such as pointers and their
usage, parts of speech, and a maintenance toolkit. HowNet also have a Chinese GB Knowledge
Dictionary, a Chinese Big5 Knowledge Dictionary and an English Knowledge Dictionary (though
not as comprehensive as the Chinese one) which are extracted from the bilingual Knowledge
Dictionary.
The knowledge dictionary is the core component of HowNet. The size of HowNet depends mainly
on the size of its Chinese-English Bilingual Knowledge Dictionary which is measured by the
number of word forms and the number of concepts (meanings). The size of HowNet 1.0a is as
follows:
Word forms
Languages
Chinese
English
Total
50220
55422
N-category
26037
28876
V-category
16657
16706
A-category
09768
10716
Concepts
Languages
Chinese
English
Total
62174
72994
N-category
29787
36770
V-category
20468
21203
A-category
11173
14339
Note: the categories N, V, A do not exactly correspond to parts of speech Noun, Verb
and Adj.
Main Features of Concepts (1) (MFC-1) is composed by 800 main features of events in a hierarchy.
There is a role framework for each main feature, the absolutely necessary roles are specified for an
main feature. .Main Features of Concepts (2) (MFC-2) is composed by 140 main features of things
(including physical or mental substance, facts, attributes, space and time) in a hierarchy. In HowNet
the universality of the concepts of the same category is indicated in each main feature.In HowNet
concepts are well defined by 1400 main features and secondary features, with the help of pointers
and the HowNet Concept Definition Markup Language.
4.1.2 Types of Concept RelationsHowNet is powerful in describing relations between
concepts. It describes not only the relations within the same categories, but also describes
cross-category relations. The types of relations described by HowNet are mainly as follows:
a. superordinate-subordinate (by means of MFC-1 and MFC-2)
b. synonym (by means of
DEF and bilingual equivalents) c. antonym (by means of LAR) d. converse (by means of
LCR)
e. part-whole (coded with pointer %, e.g. “heart”, “CPU”, etc)
f. attribute-host (coded with pointer &, e.g. “color”, “speed”, etc)
g. material-product (coded with pointer ?, e.g. “cloth”, “flour”, etc)
h. agent-event (coded with pointer *, e.g. “doctor”, “employer”, etc)
(may also be “experiencer” or “relevant”, depending on the type of event)
i. patient-event (coded with pointer $, e.g. “patient”, “employee”, etc)
(may also be “content” or “possession”, etc. depending on the type of event)
j. instrument-event (coded with pointer *, e.g. “watch”, “computer”, etc)
k. location-event (coded with pointer @, e.g. “bank”, “hospital”, “shop”, etc)
l. time-event (coded with pointer @, e.g. “holiday”, “pregnancy”, etc)
m. value-attribute (coded without pointer, e.g. “blue”, “slow”, etc)
n. entity-value (coded without pointer, e.g. “dwarf”, “fool”, etc)
o. event-role (coded with role-name, e.g. “wail”, “shopping”, “bulge”, etc)
p. concepts related (coded with pointer #, e.g. “cereal”, “coalfield”, etc)
4.1.3 Concept Description in Knowledge Dictionary (KD)Every concept as an entry in
HowNet has the following description items for each language:
W_X= word or phrase
G_X= POS
E_X= examples
DEF= concept definition
Let's look at some examples and see how the concepts are defined in KD of HowNet.
NO.=040995
NO.=016525W_C=教
G_C=V
G_C=N
E_C=
E_C=
W_E=teach
W_E=university
G_E=V
G_E=N
E_E=
E_E=
DEF=teach|教,education|教育
DEF=InstitutePlace|场所,@teach|教,@study|学,education|教育
NO.=041046
NO.=089130
W_C=教授
W_C=学生
G_C=N
G_C=N
E_C=
E_C=
W_C=大学
W_E=professor
W_E=student
G_E=N
G_E=N
E_E=
E_E=
DEF=human|人,*teach|教,education|教育
DEF=human|人,*study|学,education|教育
NO.=052920
NO.=006112
W_C=论文
W_C=博士
G_C=N
G_C=N E_C=
W_E=paper
W_E=doctor
G_E=N
G_E=N
E_E=
E_E=
DEF=text|语文,#research|研究
DEF=human|人,*research|研究,*study|学,education|教育
NO.=092249
NO.=092291
W_C=医
W_C=医院
G_C=V
G_C=N
E_C=
E_C=
E_C=
W_E=treat
W_E=hospital
G_E=V
G_E=N
E_E=
E_E=
DEF=cure|医治
DEF=InstitutePlace|场所,@cure|医治,#disease|疾病,medical|医
NO.=092273
NO.=034930
W_C=医生
W_C=患者
G_C=N
G_C=N
E_C=
E_C=
W_E=doctor
W_E=patient
G_E=N
G_E=N
E_E=
E_E=
DEF=human|人,*cure|医治,medical|医
DEF=human|人,*SufferFrom|罹患,$cure|医治,#medical|医
Note: “human|人”, “InstitutePlace|场所”, “text|语文”, “disease|疾病”, etc. are N-category main
features; “SufferFrom|罹患”, “cure|医治”, “research|研究,”, etc. are V-category main features;
“education|教育”, “medical|医” are secondary features. “*”, “@”, “$”, etc. are pointers. So the
DEF of the word “hospital” means that hospital is a place where (expressed by @) people treat
diseases, it belongs to medical domain.
4.1.4 Event Role Framework
As we mentioned before, each of the 800 main features of events is attached by a role framework,
in which the absolutely necessary roles for the events of the same main feature are specified. By
absolutely necessary roles, we mean those which will definitely join once an event happens, no
matter it or they are expressed in real communications. A few lines of MFC-1 are cited below.
V1.02
possession| 领 属 关 系
BelongTo| 属 于
{relevant,possessor}
own| 有
OwnNot| 无
{relevant,possession}
{relevant,possession}
lose|失去 {relevant,possession}
V2.02
AlterPossession| 变 领 属 {agent,possession}
take| 取
{agent,possession,source}
earn| 赚
{agent,possession,source}
buy|买 {agent,possession,source,cost,~beneficiary}[commercial|商]
collect|收
{agent,possession,source}
Whenever the main feature own|有, i.e. the category for “possess”, “have”, etc. happens, “who”
(possesses) and what (he possesses) will definitely exist.
As for event buy|买, “who” (bought), “what” (he bought), “where” (he bought from), “how much”
(he paid for it), and “for whom” (he bought, or just for himself).
4.2 Application of HowNet
4.2.1 For Better Understanding
let’s take the example sentences in 3.2 to illustrate the process of new source language parsing.
In the past few years the Tanakas bought one or two toys for their children
every time they travelled abroad. It is really a pity that now they lost nearly all
of the them.
Different from the present technique, the new analysis technique, apart from traditional syntactic
parsing, will make slot-filling when it come across a verb in the sentence. In the above sentences,
the verbs the analyzer will find are “buy”, “travel”, “lose”. It will consult HowNet’s MOF-1, and
then may construct a table as the follows.
“buy”
“travel”
“lose”
agent
Tanakas
agent
they
relevant
they
possession
toy
location
abroad
possession
them
source
?
direction
?
cost
?
beneficiary
children
LocationIni
?
time
travel
duration
year
LocationFin
?
The slot-filling process will be repeatedly done on a selected length of the text (sentence-group),
for an unknown role would not be found in the sentence presently being processed, but might be
detected in the next sentence or in the previous sentence.
After the process of slot-filling, the rules of “relation of event” and “role shifting” will be applied.
It is clearly shown by itself that they can help solve the problems of anaphora or ellipsis in a very
reasonable way. In the above example, with the help of rules of “role shifting”, it is easy for the
analyzer to give a correct answer to the question of “who lost what”.
4.2.2 For Sense Disambiguation
In traditional patterns, the sense disambiguation in MT is mainly done by rules. As usual one sense
has to be determined by one or more rules. Since the processing is confined within a single
sentence scope, the traditional techniques are rather incompetent.
An innovation will also be made for sense disambiguation on the condition that new knowledge
resources like HowNet is used. The new approach will take a sentence-group as its testing field and
calculate the semantic distance between the senses to be disambiguated and the other senses in the
field. The calculation are based mainly on two sets of data in the knowledge resources. In HowNet
the two sets of data are coded in DEF and E_E or E_C (examples). For disambiguation of most of
the solid words such as “doctor”, “bank”, “table”, we will not depend on rules in the traditional
patterns. The sharper the contrast between two senses to be disambiguated is, the easier the
disambiguation will be.
In 4.1.3 the contrast between two senses of the word “doctor” is shown by their DEFs. So when we
make sense disambiguation of them, we would consult the DEFs of the other solid words in the
sentence-group, to see how many features are similar to each of the senses.
One of advantages of the approach is that its algorithm is language-independent and
system-independent. A sense disambiguation tool for MT can also be used for some other
applications.
5. Conclusion
Translation is really very difficult. It needs not only linguistic knowledge but also
general knowledge. And in a sense it is a kind of art, for it needs the art to manipulate
words. MT, although sometimes mad, give us some help in technical translation or in
browsing in the Internet. The two factors we propose in this paper, i.e. bigger context
processing and the of new knowledge resources, may be the key criterion for
next-generation MT technology. Hopefully we will not have a disappointing future.
References
Chang, Jing-shin, Keh-yih Su (1997) Corpus-based Statistics-oriented (CBSO) Machine
Translation Researches in Taiwan, MT Summit VI Proceedings
Dong, Zhendong (1998) 未来机器翻译研究的展望, ComputerWorld
Gerber, Laurie (1997) R&D for Commercial MT, MT Summit VI Proceedings
Hovy, Eduard & Gerber, Laurie (1997) MT at the Paragraph Level: Improving English
Synthesis in SYSTRAN TMI '97
Kay, Martin (1996) Machine Translation: The Disappointing Past and Present, Survey
of the State of the Art in Human Language Technology
Nagao, Makoto (1997) Machine Translation Through Language Understanding, MT Summit
VI Proceedings
Proc. of the International Conference on Machine Translation & Computer Language Information
Processing
pp.17-25,1999/6
Download