2 Fillmore Theory

advertisement
ATRANF: Machine Translation System Prototype
(Application on Arabic to French Translation)
Fahima Bouzit, Mohamed Tayeb Laskri
Badji Mokhtar University, Algeria
Bouzit.fahima@gmail.com, laskri@univ-annaba.org
ABSTRACT
This work falls within the framework of natural language processing. Our goal is
to develop a machine translation system from Arabic to French using a purely
linguistic method as a key to get insights into the standard layer-based structure of
linguistic phenomena (morphology, syntax and semantics) as well as in
recognizing the interaction between them, which we think is the most appropriate
to a such rich in morphology and syntax language as Arabic. To ensure that, we
used a set of linguistic theories such as: Fillmore theory, conceptual dependency,
semantic traits of Chafe and frame representation of Minsky. We will show in this
paper, the usefulness of these methods and how we combined them to realize a
multilingual translation system.
Keywords:machine translation, linguistic approach, Arabic, semantic cases,
Fillmore theory.
1
INTRODUCTION
Machine translation has for decades attracted the
interest of researchers in artificial intelligence, which
gave rise to two different currents. The current based
on linguistic theories, where the principle is to
identify all the rules and features of the language, in
order to use them to create dictionaries and rules, and
use a set of formalisms to move from the external
representation of the sentence, to an internal
universal one. The second current consists to use
large amounts of aligned bilingual text to estimate
the probabilities of models; this is the statistical or
probabilistic current.
In this paper, we will describe our idea which
aims to develop Arabic to French machine
translation system using a purely linguistic approach,
which combines several methods as Fillmore Theory,
the semantic features of Chafe and the frame based
representation of Minsky.
In the linguistic approach, one of the most
important phases is the semantic analysis, which
involves extracting the meaning of surface structures
using a variety of tools and methods.
To understand the meaning of a sentence, it is
essential to know the meaning of its various
components and the role of each one of them[5].In
order to ensure this, we used Fillmore theory. The
idea is to consider the verb as the kernel of the
sentence and to study the role of its other
constituents (nouns) with this kernel.
2
Fillmore Theory
Verbs differ according to their typological
characteristics, for example, there are verbs that
require the semantic cases: 'Agent' and 'Subject',
although some other verbs require other cases, such
as 'Source' and 'Destination'. The cases ideally form a
single, limited, small in number, universal and valid
list in all languages [3]. For the Arabic language,
semantic cases are identified by casual marks (short
vowels), for example, the case agent is marked by
the nominative grammatical case marked by the
diacritic ' ُ' or the suffixes '‫ 'ان‬or '‫'ون‬. Where the case
instrument is recognized by the dative case ' ُ' or the
suffixes '‫ 'ين‬or 'ُ ‫ 'ين‬and is preceded by the
preposition '‫ 'بـ‬or the words '‫ 'باستعمال‬,'‫'بواسطة‬, etc.
The advantage of this method is that it allows to
give to the sentence, a representation that does not
stop at the tips of the results of syntax parsing [2], in
other words, even if two sentences have different
representations, they may transport the same
meaning. For example, the sentences:
– ‫( أرسل الطفل الرسالة اإللكترونية إلى األستاذ‬The child sent
an e-mail to the teacher).
– ‫( أرسلت الرسالة اإللكترونية إلى األستاذ من طرف الطفل‬an email was sent to the teacher by the child),
The subject is different, although, the action
(verb) is the same, and the words ‫( الرسالة اإللكترونية‬email) and ‫( الطفل‬child) play the same syntactic role:
subject, while the agent is in both cases ‫( الطفل‬child)
and the object is always: ‫( الرسالة اإللكترونية‬e-mail).
We extracted and specified the Arabic semantic
cases, based on its characteristics [5], [7], here are
some examples:
 The case AGENT:
Syntactic Case = Subject.
(grammatical case=Nominative)
 The case OBJECT:
Syntactic Case = Object Comp
Or
Syntactic Case = Subject
Verb mode = Passive
system
can
identify
the
agent
which
is ”‫”األطفال‬through the semantic features of the verb
‫[يحب‬+ human], which means that this action (‫)يحب‬
can be made only by a human and therefore the
system checks the features of every noun in the
sentence: ‫[مواقع‬-human], ‫[األطفال‬+ human] and
consequently decides that, the agent can only be
‫األطفال‬.
4
 The case INSTRUMENT:
Grammatical Case = Dative
Preposition =‫بواسطة‬,‫باستعمال‬, ‫بـ‬
 The case SOURCE:
Grammatical Case = Dative
Preposition = ‫من‬
Or
A place noun playing the role
of a direct object complement
of some known verbs, such us : ‫غادر‬, ‫ترك‬
like in ‫غادر الطفل المدرسة‬
 The case DESTINATION:
Grammatical Case = Dative
Preposition = ‫ُلـ‬, ‫إلى‬, ‫نحو‬, ‫صوب‬, Or
A place noun playing the role of
a direct object complement of
some known verbs, such us :‫قصد‬
like in ‫قصد المسافر الموقع‬
3
SEMANTIC TRAITS OF CHAFE
This method consists to endow every noun in the
definitions dictionary, with many semantic traits,
showing the relations it may have with the other
words used with it in the sentence.
For noun representation, Chafe proposed a
classification model. He defined a list of semantic
traits (markers) that represent noun proprieties.
According to Chafe, the noun is characterized
with the traits: Animated, Human, Feminine, Unique,
Concrete, Countable and Potent. [6] and the traits:
Consumable and Dimension could be added [4]:
• ‫([ =المستخدم‬+) Animated, (+) Human, (-) Feminine,
(-) Unique, (+) Concrete, (+) Countable, (+) Potent,
(-) Consumable, (-) Dimension]
• ‫([ =الشاشة‬-) Animated, (-) Human, (+) Feminine, (-)
Unique, (+) Concrete, (+)Countable, (-) Potent, (-)
Consumable, (-) Dimension]
Although this method has been developed and
used only for names, we proposed to apply it on
verbs to solve the problem of the lack of information
that occurs if the user wants to translate a text
without short vowels. So if the user wants to
translate the sentence: ‫يحب مواقع األلعاب األطفال‬, the
FRAME REPRESENTATION
Once the semantic cases drawn, we must find a
way to represent them and the relations existing
between them. There is a wild choice of methods and
formalisms for this (Context Free Grammar,
Recursive Transition Networks, logic grammar,
knowledge based processing,…), but given the
characteristics of the Arabic language: we can swap
the components of the sentence without changing its
meaning. For example: the six sentences are correct,
and transport the same meaning:
‫طبع الطفل النص بالطابعة بسهولة‬
‫الطفل طبع النص بالطابعة بسهولة‬
- ‫طبع الطفل بالطابعة بسهولة النص‬
‫بسهولة طبع الطفل النص بالطابعة‬
- ‫طبع الطفل النص بسهولة بالطابعة‬
- ‫طبع النص الطفل بسهولة بالطابعة‬
So we can’t use any of those formalisms,
because we will be face to huge and difficult to build
representations and grammars. This leads us to
choose the method proposed by Minsky: frames [5].
Frames have a whole set of slots, reserved for the
various concepts contained in the sentence to
represent, what drives us to provide a slot for each
component that may be encountered Fig. 1.
Action
Patient
‫نسخ‬
Agent
Object
‫الطفل‬
ّ‫النص‬
Source
Destination
Furnisher
Time
Instrument
‫الطابعة‬
Beneficiary
Place
‫الماضي‬
State
Manner
Purpose
‫سهولة‬
Figure 1:Representation in the basic frame of the
sentence ‫نسخ الطفل النص بالطابعة بسهولة‬
One can easily notice that there are several slots
that are empty, and this is in fact, the drawback of
this type of representations: the waste of storage
space, because in general, each verb has its own
characteristics and therefore requires a reduced
number of slots[4].
We know that there are verbs that require the
same slots as others, and that most verbs are used to
express ideas that may well be expressed by other
basic verbs. This leads us to use a method of
classification of verbs. For that raison, we chose the
theory of the conceptual dependency. So, the
sentence ‫( نسخ الطفل النص بالطابعة بسهولة‬the child easily
printed the text with the printer), for instance, will be
represented in a frame where the number of slot is
smaller (a specialized frame), Fig. 2.
Action
‫طبـع‬
Agent
Object
‫الطفل‬
ّ‫النص‬
Source
Destination
Instrument
primitive of the verb to drink (INGEST primitive) is
the same for the verbs to eat or swallow[6].
This theory therefore allows the reduction of
each set of verbs in a primitive, which shall be the
representative, and will now undergo a common
treatment for all these verbs, instead of duplicating it
for each one.
In our work, we considered the eleven basic
actions proposed by Shank [6], Table 1.
So for two verbs that refer to two similar actions,
we use the same primitive, for example, for any verb
that denotes an action of transfer of something
abstract (eg possession), such as the verbs: ,‫ سلم‬,‫أخد‬
‫أرسل‬, we use the primitive ATRANS. Thus, we have
the same frame that represents these verbs. The
difference lies in the contents of the Action field.
We can implement the frame as a list, table, or
using objects. We defined a class for each primitive,
and during the frame construction phase, we
instantiate an object of the class to which the verb
belongs and fill its fields.
Table 1:The eleven basic primitives proposed by
Schank.
‫الطابعة‬
PROPEL
MOVE
GRASP
INGEST
EXPEL
Apply a force to something
Moving a body part
Catch an object
Ingest, for a moving object
Physically expel, for a moving
object
PTRANS Move a physical object
ATRANS Modify an abstract relationship,
such as possession
SPEAK
Produce a sound; support of an
action such as "Communicate"
ATTEND Apply his attention to a
perception or stimulus
MTRANS Information Transfer
MBUILD Creating a new though
Time
‫الماضي‬
Manner
‫سهولة‬
Figure 2:Representation of the sentence
‫ نسخ الطفل النص بالطابعة بسهولة‬in a specialized frame
5
CONCEPTUAL DEPENDENCY
This theory is characterized by the following
axioms:
1. Two sentences have the same meaning in one
language or two languages (although they have very
different syntactic structures) should have the same
internal representation.
2. Any information implied in a sentence must be
made explicit in the representation.
3. Any action is expressed in terms of primitives.
Each primitive has an associated diagram which
must be instantiated and filled (at least partially) in
the understanding process. For example, the
6
DICTIONNARIES
To have all the information about the different
words needed during the analysis, it is necessary to
have in the dictionary (fields or tables) that contains
any information that could be useful.
To insure this, we initially ranked the words in
the dictionary in four tables: verbs, nouns, adjectives
and particles. For example, the table Verbs contains
the stem of the verb (verb in the past with the
masculine singular person: ‫ )هو‬and its primitive, but
also its various semantic features, and its translation,
Table 2.
But during the implementation, we noted that there
are special cases that must be treated separately.
Table 2:Part of the dictionary
Verb
Primitive
Animated
Human
Feminine
..
..
Dimension
Translation
Noun
Adjective
Animated
Human
Feminine
..
..
Dimension
Translation
Example: let the words:
‫( لوحة‬panel)
= panneau
‫( تحكم‬configuration) =
configuration
‫( مفاتيح‬keys)
=
clés
The verbatim translation (to French) gives:
‫ =لوحة التحكم‬panneau de configuration;
‫ =لوحة المفاتيح‬panneau de clés;
While ‫ لوحة المفاتيح‬in French is: clavier.
The solution we proposed was to put these
strings of words in a table called Sequences, and see
during processing (step of construction of the frame)
if the text contains one of these suites, in which case
we put directly its translation in the target frame.
And therefore the number of tables used by the
analysis module and the module for word translation
is five: verbs, nouns, adjectives, particles and
sequences.
Another case we can underline, is the distinction
between the manner and the instrument, because
both cases have the grammatical case Dative and are
preceded by the preposition ‘ ‫’ بـ‬, for example:
‫طبع الطفل النص بالطابعة‬and ُ‫طبع الطفل النص بسهولة‬
Both words ‫ الطابعة‬and ‫سهولة‬are preceded by the
preposition ‘ ‫ ’ بـ‬but ‫ الطابعة‬is an instrument and ‫سهولة‬
is a manner, Looking in the characteristics of Arabic
language, we found that when the the preposition
‘ ‫ ’ بـ‬precede a noun which we can derive to an
adjective, then this noun describes a manner, else it
is an instrument:
‫( سهولة‬we can derive from it the adjectives: ‫ سهل‬and
‫ >= )سهلة‬it describes a manner;
‫( الطابعة‬we can’t derive an adjective from this noun)
=> it is an instrument
This is why we added to the Noun table an
entree which we called Adjective to mention if it can
be derived to an adjective or not, Table 2.
the words of the sentence with the verbs table, then
consults the primary field (class of the verb) to
extract the type of verb (ATrans, PTrans, ...), it
makes an instantiation of this class and starts filling
the fields. Then, the definition dictionary is consulted
to extract the semantic features of each word of the
sentence in order to analyze them:
• For example, the subject is recognized by the
casual mark: Damma: ' ُ ', and according to the
semantic features of the verb and nouns of the
sentence.
• The instrument is recognized by the particles or
words (‫ بواسطة‬,‫)بـ‬.
• The source and destination with the particles: (‫)من‬
and (‫ صوب‬,‫ نحو‬,‫)إلى‬.
• Concerning the adjectives, and particles, they can
be recognized easily (a table was devoted to the
adjectives and another one, to the particles), etc.
At this stage, we get a representation of the
sentence, independently of any language; this is what
we call universal or internal representation. And it is
in fact, the strength point of this approach, because it
permits the generation of a translation to any target
language; we just need to add a module to support it.
So, after the frame of the Arabic sentence
(Arabic frame) is created, comes the role of the
module: word for word translation, and we get the
destination (target) frame. Then, the system dials the
sentence from the target frame in a sequence that has
been previously defined following the syntax rules of
the target language (French):
The Sentence
in Arabic
The Sentence
in French
Analysis
7
Construction
SYSTEM ARCHITECTURE
As shown in the diagram Fig. 3, when we
introduce a sentence to be translated, the system
begins the analysis step. In fact, the analysis goes
through three phases: a morpho-lexical analysis that
aims to recognize each word in the sentence, a
syntactic analysis to pull the various syntactic cases
(subject, object…). The results of this phase are the
inputs of the next one: the semantic analysis.
The system recognizes the action by comparing
Frame in
Arabic
Frame in the
French
Translation
Figure 3:The simplified architecture of the system
We note that the translation produced by the system
(based on the DCF approach) goes through three
phases: analysis, word for word translation and
generation.
- Affirmative Sentence:
Sentence = Agent Action [ ‘de’ + Source][ ‘a’ +
Destination][ ‘avec’ + Instrument] ...
-
Negative Sentence:
Example: if we want to translate these sentences
from Arabic to French, Table 3:
- ‫ أرسل الطفل رسالة إلكترونية إلى األستاذ‬and
- ‫لم يحذف المستعمل الملف من القرص‬
Sentence = Subject + ne (n’) + verb + pas +object …
- Imperative Sentence:
Sentence = verb (infinitive) + object + …
The analysis of the sentences will lead to the
construction of source frames (Arabic frames).
Then, we will pass by the word for word
translation of each word in the source frame to get
the target frames.
Finally, we will generate the translation into the
target language, and reorganize the result.
The system and before providing the result,
organizes the sentence following the rules of syntax
and grammar of the target language (French):
_ the time of Verbs: Present / past / future,
_ the gender and number of names;
All these treatments (sentence generation frame
from the target and its organization to meet the
compliances of destination language) are provided
by the module of management of the target language.
In French, verbs are divided into three groups:
two groups for regular verbs and the third for
irregular ones. Concerning the regular verbs, there
are some rules to be respected, to put the verb in the
asked tense, but concerning irregular verbs, we chose
to put in an appropriate table, all the forms every one
can take in the different tenses,. In addition to this,
every verb in French, has an appropriate auxiliary
“etre” or “avoir” (in the case of composed tenses)
which have to be specified in the verb table
For the nouns, we put in the tables; the singular
masculine form, singular feminine form, plural
masculine and plural feminine form.
Examples:
-
Le livre est vendu
La revue est vendue
Les revues sont vendues
Les livres sont vendus
8
CONCLUSION & PERSPECTIVES
Our system which some modules were exposed
in this article is therefore, in the semantic processing
of texts using purely linguistic and finds fulfillment
with the DCF method as a basis. This method has
proved highly adaptable to the Arabic language and
its peculiarities as to syntax and semantics [1], [4],
[5], [6].
We can underline as prospects for this work, to
integrate to the system a good morphological
analyzer (such as the tool: Aramorph), and enrich the
dictionaries used to cover other application areas and
improve the results, because when the dictionaries
are richer and the rules are well defined, the resulting
translation will be more accurate.
We aim also to implement other target languages
modules to build a multilingual translation system.
/ beau
/ belle
/ belles
/ beaux
Table 3:see the table shown here examples of translations
Tense
past
past
Destination
‫االستاذ‬
Le enseignant
Object
‫رسالة إلكترونية‬
Courrier électronique
Agent
‫الطفل‬
Le enfant
Action
‫ارسل‬
envoyer
L’enfant a envoyé un courrier électronique à l’enseignant
Tense
past
past
Destinataire
‫القرص‬
le disque
Objet
‫الملف‬
le fichier
Agent
‫المستعمل‬
Le utilisateur
Le utilisateur ne pas a supprimé le fichier de le disque
L’utilisateur n’a pas supprimé le fichier du disque
Action
‫لم يحذف‬
ne pas supprimer
9
REFERENCES
[1] K. Rezeg: Une Approche Connexionniste pour
la Traduction Automatique des Textes Arabe
en Français, Courrier du Savoir, N° 08, pp. 5967 (2007).
[2] S. Russel and Al.: Artificial Intelligence with
400 Exercises, Pearson Edition (2005)
[3] K. Meftouh: Extraction Automatique du Sens
d’Une Phrase en Langue Française par une
Approche Neuronale, JADT (2002)
[4] K. Meftouh: Un Réseau Simplement Récurrent
pour la Génération d’une Représentation du
Sens d’une Phrase Ecrite en Langue Arabe,
Magister (2002)
[5] R. Mahdjoubi : Un Système pour le Traitement
Automatique de la Langue Arabe Basé sur ces
Propres Caractéristiques, Magister, (1994)
[6] M. T. Laskri : Une Sémantique du Langage
Naturel à Travers un Système Support de de
Thésaurus, Docorat d’Etat, (1994)
Download