ATRANF: Machine Translation System Prototype (Application on Arabic to French Translation) Fahima Bouzit, Mohamed Tayeb Laskri Badji Mokhtar University, Algeria Bouzit.fahima@gmail.com, laskri@univ-annaba.org ABSTRACT This work falls within the framework of natural language processing. Our goal is to develop a machine translation system from Arabic to French using a purely linguistic method as a key to get insights into the standard layer-based structure of linguistic phenomena (morphology, syntax and semantics) as well as in recognizing the interaction between them, which we think is the most appropriate to a such rich in morphology and syntax language as Arabic. To ensure that, we used a set of linguistic theories such as: Fillmore theory, conceptual dependency, semantic traits of Chafe and frame representation of Minsky. We will show in this paper, the usefulness of these methods and how we combined them to realize a multilingual translation system. Keywords:machine translation, linguistic approach, Arabic, semantic cases, Fillmore theory. 1 INTRODUCTION Machine translation has for decades attracted the interest of researchers in artificial intelligence, which gave rise to two different currents. The current based on linguistic theories, where the principle is to identify all the rules and features of the language, in order to use them to create dictionaries and rules, and use a set of formalisms to move from the external representation of the sentence, to an internal universal one. The second current consists to use large amounts of aligned bilingual text to estimate the probabilities of models; this is the statistical or probabilistic current. In this paper, we will describe our idea which aims to develop Arabic to French machine translation system using a purely linguistic approach, which combines several methods as Fillmore Theory, the semantic features of Chafe and the frame based representation of Minsky. In the linguistic approach, one of the most important phases is the semantic analysis, which involves extracting the meaning of surface structures using a variety of tools and methods. To understand the meaning of a sentence, it is essential to know the meaning of its various components and the role of each one of them[5].In order to ensure this, we used Fillmore theory. The idea is to consider the verb as the kernel of the sentence and to study the role of its other constituents (nouns) with this kernel. 2 Fillmore Theory Verbs differ according to their typological characteristics, for example, there are verbs that require the semantic cases: 'Agent' and 'Subject', although some other verbs require other cases, such as 'Source' and 'Destination'. The cases ideally form a single, limited, small in number, universal and valid list in all languages [3]. For the Arabic language, semantic cases are identified by casual marks (short vowels), for example, the case agent is marked by the nominative grammatical case marked by the diacritic ' ُ' or the suffixes ' 'انor ''ون. Where the case instrument is recognized by the dative case ' ُ' or the suffixes ' 'ينor 'ُ 'ينand is preceded by the preposition ' 'بـor the words ' 'باستعمال,''بواسطة, etc. The advantage of this method is that it allows to give to the sentence, a representation that does not stop at the tips of the results of syntax parsing [2], in other words, even if two sentences have different representations, they may transport the same meaning. For example, the sentences: – ( أرسل الطفل الرسالة اإللكترونية إلى األستاذThe child sent an e-mail to the teacher). – ( أرسلت الرسالة اإللكترونية إلى األستاذ من طرف الطفلan email was sent to the teacher by the child), The subject is different, although, the action (verb) is the same, and the words ( الرسالة اإللكترونيةemail) and ( الطفلchild) play the same syntactic role: subject, while the agent is in both cases ( الطفلchild) and the object is always: ( الرسالة اإللكترونيةe-mail). We extracted and specified the Arabic semantic cases, based on its characteristics [5], [7], here are some examples: The case AGENT: Syntactic Case = Subject. (grammatical case=Nominative) The case OBJECT: Syntactic Case = Object Comp Or Syntactic Case = Subject Verb mode = Passive system can identify the agent which is ””األطفالthrough the semantic features of the verb [يحب+ human], which means that this action ()يحب can be made only by a human and therefore the system checks the features of every noun in the sentence: [مواقع-human], [األطفال+ human] and consequently decides that, the agent can only be األطفال. 4 The case INSTRUMENT: Grammatical Case = Dative Preposition =بواسطة,باستعمال, بـ The case SOURCE: Grammatical Case = Dative Preposition = من Or A place noun playing the role of a direct object complement of some known verbs, such us : غادر, ترك like in غادر الطفل المدرسة The case DESTINATION: Grammatical Case = Dative Preposition = ُلـ, إلى, نحو, صوب, Or A place noun playing the role of a direct object complement of some known verbs, such us :قصد like in قصد المسافر الموقع 3 SEMANTIC TRAITS OF CHAFE This method consists to endow every noun in the definitions dictionary, with many semantic traits, showing the relations it may have with the other words used with it in the sentence. For noun representation, Chafe proposed a classification model. He defined a list of semantic traits (markers) that represent noun proprieties. According to Chafe, the noun is characterized with the traits: Animated, Human, Feminine, Unique, Concrete, Countable and Potent. [6] and the traits: Consumable and Dimension could be added [4]: • ([ =المستخدم+) Animated, (+) Human, (-) Feminine, (-) Unique, (+) Concrete, (+) Countable, (+) Potent, (-) Consumable, (-) Dimension] • ([ =الشاشة-) Animated, (-) Human, (+) Feminine, (-) Unique, (+) Concrete, (+)Countable, (-) Potent, (-) Consumable, (-) Dimension] Although this method has been developed and used only for names, we proposed to apply it on verbs to solve the problem of the lack of information that occurs if the user wants to translate a text without short vowels. So if the user wants to translate the sentence: يحب مواقع األلعاب األطفال, the FRAME REPRESENTATION Once the semantic cases drawn, we must find a way to represent them and the relations existing between them. There is a wild choice of methods and formalisms for this (Context Free Grammar, Recursive Transition Networks, logic grammar, knowledge based processing,…), but given the characteristics of the Arabic language: we can swap the components of the sentence without changing its meaning. For example: the six sentences are correct, and transport the same meaning: طبع الطفل النص بالطابعة بسهولة الطفل طبع النص بالطابعة بسهولة - طبع الطفل بالطابعة بسهولة النص بسهولة طبع الطفل النص بالطابعة - طبع الطفل النص بسهولة بالطابعة - طبع النص الطفل بسهولة بالطابعة So we can’t use any of those formalisms, because we will be face to huge and difficult to build representations and grammars. This leads us to choose the method proposed by Minsky: frames [5]. Frames have a whole set of slots, reserved for the various concepts contained in the sentence to represent, what drives us to provide a slot for each component that may be encountered Fig. 1. Action Patient نسخ Agent Object الطفل ّالنص Source Destination Furnisher Time Instrument الطابعة Beneficiary Place الماضي State Manner Purpose سهولة Figure 1:Representation in the basic frame of the sentence نسخ الطفل النص بالطابعة بسهولة One can easily notice that there are several slots that are empty, and this is in fact, the drawback of this type of representations: the waste of storage space, because in general, each verb has its own characteristics and therefore requires a reduced number of slots[4]. We know that there are verbs that require the same slots as others, and that most verbs are used to express ideas that may well be expressed by other basic verbs. This leads us to use a method of classification of verbs. For that raison, we chose the theory of the conceptual dependency. So, the sentence ( نسخ الطفل النص بالطابعة بسهولةthe child easily printed the text with the printer), for instance, will be represented in a frame where the number of slot is smaller (a specialized frame), Fig. 2. Action طبـع Agent Object الطفل ّالنص Source Destination Instrument primitive of the verb to drink (INGEST primitive) is the same for the verbs to eat or swallow[6]. This theory therefore allows the reduction of each set of verbs in a primitive, which shall be the representative, and will now undergo a common treatment for all these verbs, instead of duplicating it for each one. In our work, we considered the eleven basic actions proposed by Shank [6], Table 1. So for two verbs that refer to two similar actions, we use the same primitive, for example, for any verb that denotes an action of transfer of something abstract (eg possession), such as the verbs: , سلم,أخد أرسل, we use the primitive ATRANS. Thus, we have the same frame that represents these verbs. The difference lies in the contents of the Action field. We can implement the frame as a list, table, or using objects. We defined a class for each primitive, and during the frame construction phase, we instantiate an object of the class to which the verb belongs and fill its fields. Table 1:The eleven basic primitives proposed by Schank. الطابعة PROPEL MOVE GRASP INGEST EXPEL Apply a force to something Moving a body part Catch an object Ingest, for a moving object Physically expel, for a moving object PTRANS Move a physical object ATRANS Modify an abstract relationship, such as possession SPEAK Produce a sound; support of an action such as "Communicate" ATTEND Apply his attention to a perception or stimulus MTRANS Information Transfer MBUILD Creating a new though Time الماضي Manner سهولة Figure 2:Representation of the sentence نسخ الطفل النص بالطابعة بسهولةin a specialized frame 5 CONCEPTUAL DEPENDENCY This theory is characterized by the following axioms: 1. Two sentences have the same meaning in one language or two languages (although they have very different syntactic structures) should have the same internal representation. 2. Any information implied in a sentence must be made explicit in the representation. 3. Any action is expressed in terms of primitives. Each primitive has an associated diagram which must be instantiated and filled (at least partially) in the understanding process. For example, the 6 DICTIONNARIES To have all the information about the different words needed during the analysis, it is necessary to have in the dictionary (fields or tables) that contains any information that could be useful. To insure this, we initially ranked the words in the dictionary in four tables: verbs, nouns, adjectives and particles. For example, the table Verbs contains the stem of the verb (verb in the past with the masculine singular person: )هوand its primitive, but also its various semantic features, and its translation, Table 2. But during the implementation, we noted that there are special cases that must be treated separately. Table 2:Part of the dictionary Verb Primitive Animated Human Feminine .. .. Dimension Translation Noun Adjective Animated Human Feminine .. .. Dimension Translation Example: let the words: ( لوحةpanel) = panneau ( تحكمconfiguration) = configuration ( مفاتيحkeys) = clés The verbatim translation (to French) gives: =لوحة التحكمpanneau de configuration; =لوحة المفاتيحpanneau de clés; While لوحة المفاتيحin French is: clavier. The solution we proposed was to put these strings of words in a table called Sequences, and see during processing (step of construction of the frame) if the text contains one of these suites, in which case we put directly its translation in the target frame. And therefore the number of tables used by the analysis module and the module for word translation is five: verbs, nouns, adjectives, particles and sequences. Another case we can underline, is the distinction between the manner and the instrument, because both cases have the grammatical case Dative and are preceded by the preposition ‘ ’ بـ, for example: طبع الطفل النص بالطابعةand ُطبع الطفل النص بسهولة Both words الطابعةand سهولةare preceded by the preposition ‘ ’ بـbut الطابعةis an instrument and سهولة is a manner, Looking in the characteristics of Arabic language, we found that when the the preposition ‘ ’ بـprecede a noun which we can derive to an adjective, then this noun describes a manner, else it is an instrument: ( سهولةwe can derive from it the adjectives: سهلand >= )سهلةit describes a manner; ( الطابعةwe can’t derive an adjective from this noun) => it is an instrument This is why we added to the Noun table an entree which we called Adjective to mention if it can be derived to an adjective or not, Table 2. the words of the sentence with the verbs table, then consults the primary field (class of the verb) to extract the type of verb (ATrans, PTrans, ...), it makes an instantiation of this class and starts filling the fields. Then, the definition dictionary is consulted to extract the semantic features of each word of the sentence in order to analyze them: • For example, the subject is recognized by the casual mark: Damma: ' ُ ', and according to the semantic features of the verb and nouns of the sentence. • The instrument is recognized by the particles or words ( بواسطة,)بـ. • The source and destination with the particles: ()من and ( صوب, نحو,)إلى. • Concerning the adjectives, and particles, they can be recognized easily (a table was devoted to the adjectives and another one, to the particles), etc. At this stage, we get a representation of the sentence, independently of any language; this is what we call universal or internal representation. And it is in fact, the strength point of this approach, because it permits the generation of a translation to any target language; we just need to add a module to support it. So, after the frame of the Arabic sentence (Arabic frame) is created, comes the role of the module: word for word translation, and we get the destination (target) frame. Then, the system dials the sentence from the target frame in a sequence that has been previously defined following the syntax rules of the target language (French): The Sentence in Arabic The Sentence in French Analysis 7 Construction SYSTEM ARCHITECTURE As shown in the diagram Fig. 3, when we introduce a sentence to be translated, the system begins the analysis step. In fact, the analysis goes through three phases: a morpho-lexical analysis that aims to recognize each word in the sentence, a syntactic analysis to pull the various syntactic cases (subject, object…). The results of this phase are the inputs of the next one: the semantic analysis. The system recognizes the action by comparing Frame in Arabic Frame in the French Translation Figure 3:The simplified architecture of the system We note that the translation produced by the system (based on the DCF approach) goes through three phases: analysis, word for word translation and generation. - Affirmative Sentence: Sentence = Agent Action [ ‘de’ + Source][ ‘a’ + Destination][ ‘avec’ + Instrument] ... - Negative Sentence: Example: if we want to translate these sentences from Arabic to French, Table 3: - أرسل الطفل رسالة إلكترونية إلى األستاذand - لم يحذف المستعمل الملف من القرص Sentence = Subject + ne (n’) + verb + pas +object … - Imperative Sentence: Sentence = verb (infinitive) + object + … The analysis of the sentences will lead to the construction of source frames (Arabic frames). Then, we will pass by the word for word translation of each word in the source frame to get the target frames. Finally, we will generate the translation into the target language, and reorganize the result. The system and before providing the result, organizes the sentence following the rules of syntax and grammar of the target language (French): _ the time of Verbs: Present / past / future, _ the gender and number of names; All these treatments (sentence generation frame from the target and its organization to meet the compliances of destination language) are provided by the module of management of the target language. In French, verbs are divided into three groups: two groups for regular verbs and the third for irregular ones. Concerning the regular verbs, there are some rules to be respected, to put the verb in the asked tense, but concerning irregular verbs, we chose to put in an appropriate table, all the forms every one can take in the different tenses,. In addition to this, every verb in French, has an appropriate auxiliary “etre” or “avoir” (in the case of composed tenses) which have to be specified in the verb table For the nouns, we put in the tables; the singular masculine form, singular feminine form, plural masculine and plural feminine form. Examples: - Le livre est vendu La revue est vendue Les revues sont vendues Les livres sont vendus 8 CONCLUSION & PERSPECTIVES Our system which some modules were exposed in this article is therefore, in the semantic processing of texts using purely linguistic and finds fulfillment with the DCF method as a basis. This method has proved highly adaptable to the Arabic language and its peculiarities as to syntax and semantics [1], [4], [5], [6]. We can underline as prospects for this work, to integrate to the system a good morphological analyzer (such as the tool: Aramorph), and enrich the dictionaries used to cover other application areas and improve the results, because when the dictionaries are richer and the rules are well defined, the resulting translation will be more accurate. We aim also to implement other target languages modules to build a multilingual translation system. / beau / belle / belles / beaux Table 3:see the table shown here examples of translations Tense past past Destination االستاذ Le enseignant Object رسالة إلكترونية Courrier électronique Agent الطفل Le enfant Action ارسل envoyer L’enfant a envoyé un courrier électronique à l’enseignant Tense past past Destinataire القرص le disque Objet الملف le fichier Agent المستعمل Le utilisateur Le utilisateur ne pas a supprimé le fichier de le disque L’utilisateur n’a pas supprimé le fichier du disque Action لم يحذف ne pas supprimer 9 REFERENCES [1] K. Rezeg: Une Approche Connexionniste pour la Traduction Automatique des Textes Arabe en Français, Courrier du Savoir, N° 08, pp. 5967 (2007). [2] S. Russel and Al.: Artificial Intelligence with 400 Exercises, Pearson Edition (2005) [3] K. Meftouh: Extraction Automatique du Sens d’Une Phrase en Langue Française par une Approche Neuronale, JADT (2002) [4] K. Meftouh: Un Réseau Simplement Récurrent pour la Génération d’une Représentation du Sens d’une Phrase Ecrite en Langue Arabe, Magister (2002) [5] R. Mahdjoubi : Un Système pour le Traitement Automatique de la Langue Arabe Basé sur ces Propres Caractéristiques, Magister, (1994) [6] M. T. Laskri : Une Sémantique du Langage Naturel à Travers un Système Support de de Thésaurus, Docorat d’Etat, (1994)