Abstract: - This paper describes a rule-based technique for

A Rule-Based Morphological Analyzer of Arabic Words ARAFAT AWAJAN Computer Science Department Princess Sumaya University College for Technology Royal Scientific Society Amman - JORDAN Abstract: - This paper describes a rule-based technique for analyzing the morphology of Arabic words. The proposed ‘Morphological Analyzer’ processes the input word in order to determine its lexical form. The lexical form of the majority of Arabic words consists of a root and a morphological pattern. The analyzer applies a set of predefined rules in order to analyze the morphology of Arabic words as they appear in real text. It is able to recognize diacriticized, undiacriticized or partially diacriticized Arabic words generated from N-letter roots. In order to determine the possible meanings of a word, the Morphological Analyzer also provides some useful attributes of the word such as its type, gender, tense and number. The proposed Morphological Analyzer is a general-purpose technique that can be integrated into larger scale systems such as automatic translation applications, text summarization applications, text correction applications, web search engines, automatic vowelization of Arabic text applications and other natural language processing applications. Keywords: - Natural Language Processing, Arabic Word Recognition, Lexical Form, Roots, Morphological Patterns, Morphological Analyzer. the subject of the morphological analysis of Arabic words. Most of these published works ignore the presence of diacritics in the Arabic text or limit the analysis to words generated from 3-letter roots [1] [4] [3]. The rule-based Morphological Analyzer presented in this paper has the objective of finding the lexical form and the possible meanings of each word in a text written in Arabic language. The proposed analyzer is being developed in order to analyze Arabic words as they appear in real text. It can be applied in the case of diacriticized, undiacriticized or partially diacriticized Arabic words. Furthermore, it allows the morphological analysis of words generated from variable root-lengths. Our approach is based on the use of the specific features and structures that the Arabic language uses for generating words. It applies a set of predefined rules specific to the Arabic language in order to extract the lexical structure of the word which generally consists of a root and a morphological pattern. A classical lexicon 1. Introduction For the Arabic language, as well as for many other languages, the morphological features of a word provide crucial information to enable understanding of text and information extraction. In fact, the possible meanings of individual words depends mainly on their morphology and their position in a sentence. Therefore, the possible meanings of a word must be determined first in order to accomplish the understanding of text written in a natural language. A number of research papers concerning the morphological analysis of words have been published for various natural languages, particularly the European languages [2] [5] [8] [6] [9]. Descriptions of real systems for analyzing the morphology of these languages are also available [7]. These works show that the complexity of the morphological analysis of words varies from one natural language to another. There have been fewer articles and published research papers written on 1 is used to verify the correctness of the analyzed word and to determine the meanings it could take. The morphological information that our technique is able to extract gives vital support to the different fields and applications of Natural Language Processing. The purpose of the Morphological Analyzer is to preprocess Arabic text in order to prepare it for some automated treatment such as humanmachine interaction, translation, text summarization, text correction, automatic vowelization of Arabic text, web search engines and other applications of natural language processing. considered to deal with the irregularities present in almost all the natural languages. The morphology of the Arabic language is based on the Semitic root-and-pattern scheme of forming words. Therefore, the majority of words are generated from basic entities called roots or radicals according to a predefined list of patterns called morphological balances or patterns [1] [4] [3]. The roots are constructed mainly from 3 letters, although 4 and 5letter roots exist too. The morphological patterns represent the major spelling rules of Arabic words. This mechanism of Arabic word generation is called ‘ALISHTIQAQ.’ This mechanism is performed by adding letters and/or diacritical marks to the roots. These additional letters and diacritical marks may be added at the beginning, at the middle or at the end of the root. In this paper, a morphological pattern is represented by the additional parts, their positions and the slots where the letters of a root can be inserted. The character “*” represents the slots of the root’s letters. Figure 1 contains examples that illustrate the “AL-ISHTIQAQ “ mechanism, it presents words generated from the same root “ K T B “ according to different morphological patterns. It is important to note the role that diacritics play in fixing the meaning of the first and second words of Figure 1. 2. The Morphology of Arabic Words In many European languages words are constructed from basic units called morphemes by adding a suffix and prefix. A morpheme is the primitive unit of meaning in a language. For example, the meaning of the English word ‘friendly’ is derivable from the meaning of the noun ‘friend’ and the suffix ‘–ly’ that transforms a noun into an adjective [3]. In such cases the morphological analysis is based on the elemination of affixes and the extraction of the basic morpheme of a word. Special treatment is always Example of words generated from the same root “K T B “ ‫ك ت ب‬ The generated words ‫ََب َ َت‬ ‫َبِ َت‬ ‫َ باب‬ ‫َ ات ت‬ ‫يكتبون‬ Their meaning in English (He) wrote (It is ) Written Book Writer (They are) writing Morphological pattern used for building the word َ * َ *َ * َ*َ *َ * *‫* * ا‬ * * ‫* ا‬ ‫يـ * * * و ن‬ Figure 1. AL-ISHTIQAQ Mechanism All classifications of words (verbs, nouns, adjectives and adverbs) can be generated from roots according to the appropriate patterns. The pattern used for generating a 2 word determines its various attributes such as gender (masculine/feminine), number (singular/plural), tense (past, present, and imperatives), mode etc. Figure 2 presents an example that shows the importance of the standard Arabic morphological patterns in fixing the meaning of a word. Based on the above, an Arabic word can be represented lexically by its root, along with its morphological pattern. The latter is one element of a countable set of limited size. A pattern is defined by a set of additive letters and/or a set of diacritical marks and their positions in the generated word. by the diacritical marks. These marks are classified into the following categories: [1]  Three diacritical marks to indicate the short vowels ( ِ ُ َ ),  Double diacritical marks which combine the single ones ( ٍ ٌ َ ),  Single Diacritical mark to indicate absence of vowelization ( َ ),  A single diacritical mark to indicate the duplicate occurrence of a consonant ( َ ) According to the extent that diacritics have been used, Arabic text may be classified into three different categories: undiacriticized, partially diacriticized, and fully diacriticized text. The first category represents text without diacritics such as typed or printed text and newspapers. The second category represents partially diacriticized text where diacritical marks are added to eliminate the ambiguities of some words. The last category represents fully diacriticized Arabic text, according to which every consonant is followed by a diacritical mark. Such a format is used for writing the Holy Koran, classic Arabic literature and children’s educational books.. The third challenge is that not all the words in Arabic text are generated from a root. For example, some words such as the tools and foreign words cannot be broken down into a root and pattern. As the number of tools is limited, a table of these predefined tools can be used to check whether a word is a tool or not before sending it to the analyzer. Meanwhile the ‘loan’ or foreign words, are listed in the lexicon and need not undergo morphological analysis. 3. Arabic Language Features and Challenges The formation of Arabic words presents specific features and challenges that must be taken into consideration when fixing the rules used by the morphological analyzer. The first challenge is that some letters of the root may be dropped or modified during the generation of words from roots. The analyzer has to rebuild the original root-letters by retrieving the missing or modified letters of the word. The second challenge is the presence of eight different types of diacritical marks, used to represent short vowels. In written text they are considered as special letters where each one is assigned a single code, as with normal letters. In fully diacriticized text a diacritical mark is added after each consonant of the word. These diacritical marks play a very important role in fixing the meaning of words. In fact, two different patterns may have the same sequence of consonants, but one is distinguished from the other solely The word ( ‫ ) لـاعـبـون‬is generated by the root play ( ‫ ) لـعــ‬according to the pattern ( ‫) فاعـلــون‬. This pattern indicates that the word is a noun, its gender is masculine, and it is plural. The final meaning will be players: (play: noun; plural; masculine) Figure 2. Role of the Morphological Pattern of an Arabic Word in Fixing its Meaning 3 mark EXTRA-SECOUN. A word is then represented by a list of character L according to the next format: 4. The Morphological Analyzer The Morphological Analyzer of Arabic words (MAAW) processes each word of the input text in order to determine its root and pattern. The results of the morphological analyzer can be used for further analysis. Figure 3 presents these transformations schematically. The identification of the morphological structure of a word depends on a rulebased system that can find the morphological pattern for diacriticized or undiacriticized words. To achieve this process, we assume that a diacritic follows each letter of the word. If a diacritic is omitted, it will be replaced by a special character (EXTRA-SECOUN) that we introduce to replace the absent diacritic. This diacritic (EXTRA-SKOUN) will be noted by a dot in the examples of this paper. A procedure ‘ Check_Diacritics’ takes the list of characters forming the word and checks for the presence of diacritics after each consonant. It replaces the absence of diacritical marks after a consonant by the [C1 V1 C2 V2 . . . Cn Vn] Where Ci is a consonant and Vi is a diacritical mark. Each one of the classical patterns is also represented by a list of the same structure where the slots of a root’s letters are marked by the character ‘*’. Figure 4 shows an example of a classical pattern representation. To deal with the three possible situations of Arabic text (fully diacriticized, partially diacriticized and undiacriticized text), the list L will be further divided into two new lists. The first list LC contains the sequence of consonant [C1, C2, . . . Cn] and the second list LV contains the diacritical characters [V1, V2, . . . Vn]. Table 1 shows examples of the segmentation of words into consonants and diacritics. The three examples given in Table 1 share the same list of consonants LC. Original Text Morphological Analyzer : MAAW Morphological Features Further Analysis (NLP Applications) Figure 3. Morphological Analyzer The pattern: “ ‫ــون‬ ْ ‫“ يَـ ْفـعــ ْل‬ Its corresponding list : [ َ * َ * َ * Figure 4. Pattern Representation 4 ‫ي‬ ]. Word Word Class ْ ‫ي‬ ‫ـذهـبُ ْـون‬ Fully diacriticized Partially diacriticized Undiacriticized ‫يـذهـبـون‬ ‫يـذهـبـون‬ List of Consonants [ ‫] ي ذ هـ و ن‬ [ ‫] ي ذ هـ و ن‬ [ ‫] ي ذ هـ و ن‬ List of Diacritics [ َ َْ َ ْ َ ] [ َ . . َ . َ ] [ . . . . . .] Table 1. Decomposition of Words into a List of Consonants and a List of Diacritics The list of consonants (LC) represents the letters of the word’s root, and the suffixes, infixes and prefixes used to form the word according to a given pattern. In order to extract the root of a word, the list LC can be represented by the following general description: representation allows us to manipulate all kind of roots (3-letters roots, 4-letters roots and 5-letters roots). Table 2 gives examples of the above representation. The first two words are generated from two different three-letter roots according to the same morphological pattern, they share the same additive parts (prefix, infix and postfix). The last three words are generated from the same root according to different patterns. The morphological patterns will also be segmented into two lists: LC and LV. For example the pattern presented above in Figure 4 can be broken down into two lists: a list of consonants (LC) and a list of diacritical marks (LV) [Figure 5]. The separation of consonants and diacritics significantly reduces the number of patterns to be tested. [X1[X2[X3]]] R1 [Y1] R2 [Y2] R3 [ [Y3] R4 [[Y4] R5]] [Z1[Z2[Z3]] ] where components X1X2X3 represent a prefix of 3 letters maximum, the components Z1Z2Z3 represents a postfix of three letters maximum, and components Y1Y2Y3Y4 represent the possible infixes of four letters maximum. The slots R1, R2, R3, R4, and R5 represent the letters of the root used to generate the word. The characters [ ] are used here to indicate that the included component is optional. This Input Word List of Consonants ‫سيـذهبـون‬ ‫سيـدرسـون‬ ‫دارسـون‬ ‫مـدرسـون‬ [ ‫] س ي ذ هـ و ن‬ [‫]سيدرسون‬ [‫] د ارسون‬ [‫] مدرسون‬ Root R1R2R3 [ ‫] ذ هـ‬ [‫]درس‬ [‫]درس‬ [‫]درس‬ Prefix X1X2X3 [‫]سي‬ [‫]سي‬ [ ] [ ‫]م‬ Infix Y1 Y2 [ ] [ ] [‫]ا‬ [‫]اي‬ Postfix Z1Z2Z3 [‫]ون‬ [‫]ون‬ [‫]ون‬ [ ] Table 2. Decomposition of the List of Consonants The pattern: “ ‫يَـفـعــلــ‬ “ List of consonant (including the slots for root) : [ List of Diacritical marks: [ * َ * ‫ي‬ * َ َ ]. َ ]. Figure 5. Decomposition of Patterns Into a List of Consonants and a List of Diacritics 5 The LC list of a given pattern will be represented by the following structure: 5. Components of Morphological Analyzer [ X1 [ X2 [ X3 ] ] ] * [ Y1 ] * [ Y2 ] * [ [Y3] * [[Y4] * ] ] [ Z1 [ Z2 [ Z3 ] ] ] The main components of the proposed Morphological Analyzer (MAAW) are shown in Figure 8. It has three analytical components: the ‘rules’ component, the ‘lexical’ component, and the ‘patterns’ component. First, the rules component consists of an engine containing the rules used to extract diacritics, and the rules used to extract the patterns and roots. Second, the pattern component lists the standard patterns, where we associate with each entry all the possible and acceptable configurations of diacritical marks, the number of configurations is limited to a maximum of 5. Third, the lexicon has a classical form and lists the roots of the Arabic language, and for each root the possible patterns that can be applied to generate words from the root. The lexicon is used to verify the correctness of the analysis performed by the other components of MAAW. If the word-correctness is verified , the extracted root, pattern and list of diacritics will be used by the lexicon to identify its possible meanings. The characters ‘*’ represent slots where consonants can be inserted to form a real word. Table 3 shows examples of the representation of morphological patterns using this schema. A comparison of Tables 2 and 3 shows that the words of Table 2 share the same prefix, infix and postfix parts with the patterns of Table 3, which means that the words of Table 2 are generated according to the corresponding patterns of Table 3. Morphological patterns can be regrouped into classes according to their list of consonants. Patterns of the same class share the same list of consonants and they differ one from the other according to the list of diacritical marks. Table 4 shows an example of three different patterns of the same class; these patterns have the same list LC and different lists LV. The set of patterns will be represented by the set of consonant lists LC, where we associate with each entry all the possible and correct combinations of diacritical marks LV. The couplet LC and LV will determine the morphology of the word. Pattern The List PLC ‫سيـفـعـلـون‬ ‫فـاعـلـون‬ ‫مـفـاعـيـل‬ [ ‫] س ي *** و ن‬ [‫] * ا ** ون‬ [ *‫] م*ا* ي‬ the Prefix X1X2X3 [‫]سي‬ [ ] [ ‫]م‬ Infix Y1 Y2 [ ] [‫]ا‬ [‫]اي‬ Postfix Z1Z2Z3 [‫]ون‬ [‫]ون‬ [ ] Table 3. Examples of Pattern Representation Pattern ‫يـ ْفـعـلُ ْـو ْن‬ ‫يُـ ْفـعـلُ ْـو ْن‬ ‫يـ ْفـ ّعـلً ْـو ْن‬ List of Consonants LC [ ‫] ي *** و ن‬ [ ‫] ي *** و ن‬ [ ‫] ي *** و ن‬ List of Diacritical Marks LV [ ْ َْ ْ َ ] [ ْ َْ ُ َْ َُ ] [ ْ َْ َُ ّ َْ َُ ] Table 4. Grouping Patterns According to Their List of Consonants The recursive rule ‘Decompose’ performs the decomposition of the word into two lists; one for consonants, LC, and the second for diacritics, LV. Decompose 6 calls another rule In. The rule ‘In’ returns TRUE only if the character H is one of the diacritical marks of the Arabic language. It marks the absence of diacritics by adding the EXTRA_SECOUN mark ‘dot.’ The recursive rule Decompose can be described by the following Prolog style code [Figure 6]. The step of identification of the root and pattern is realized by a recursive procedure ‘Match’ that takes the list LC of the input word and returns the list PLC of consonants of the pattern and the root ROOT. The Prolog-style description of the recursive rule Match is presented in [Figure 7]. Applying the rules relaying the pattern to the slots of the root letters identifies the root of the word. The rule defined by ‘FindSlot’ returns a list of integers determining the position of the letters of the root in the given pattern. It is necessary to use two separate rules to identify the root, to accommodate cases where one of the letters of a root is dropped or changed. The lexical componant of the system receives the results of the precedentanalysis: the root, the list of consonant of the pattern PLC and the list of diacritics of the word LV. The lists LV and PLC determine the pattern of the analyzed word. It then verifies the correctness of the word. An Arabic word is correct lexically if its root is an entry of the lexicon and its pattern is among the acceptable patterns of this root. If the word is correct, its meaning or meanings are provided by the lexicon. Decompose (word, LC, LV, Flag): Decompose ([ ] , _ , _ , False ). // Basic case when the decomposition is terminated Decompose ([ ] , _ , T2 , True ): // The last consonant is not Decompose([ ] , _ , [‘.’ | T2] , False).// followed by a diacritc Decompose ([H|T], _ , T2, _ ): In (H, DiacriticsList). // Detection of a diacritic Decompose(T , _ , [H|T2] , False); Decompose ([H|T], T1 , _ , False ): // Detection of a consonant at Decompose(T , [H|T1] , _ , True); // the first call of the rule or // after detection of a diacritic. Decompose ([H|T], _ , T2 , True): // detection of 2 Decompose([H|T] , _ , [‘.’,T2] , False ) //consecutive consonant Figure 6. The recursive Rule Decompose Match (LC, PLC , ROOT): FindPattern (LC , PLC), FindSlots (PLC , [list_of_slots]), FindRoot (LC , [List_of_slots], ROOT , 1); FindPattern (LC , PLC ): FindPattern ([],[]). FindPattern ([Head|Tail1], [‘*‘|Tail2]): FindPattern (Tail1, Tail2). FindPattern ([Head|Tail1], [Head|Tail2]): FindPattern (Tail1, Tail2). FindRoot (LC , [List_of_slots], ROOT, Pos): FindRoot (LC,[],[], _ ). FindRoot ([Head|Tail1], [H|T], [Head|Tail2], X): X = H, FindRoot (Tail1, T, Tail2, X+1). FindRoot ([Head|Tail1], L1, L2, X): FindRoot (Tail1, L1, L2, X+1). Figure 7. The recursive Rule March 7 Input word ) ‫( مـعـلـِمـاْت‬ Decomposition Rules List LC )‫(م ع ل م ا ت‬ iL t iL ) ْ ِ Matching Rules toot ) ‫(ع ل م‬ nrtttaP )‫ا ت‬ ‫( م‬ * * * LEXICON RESULTS Valid Word ? Yes Meaning: Root ( ‫ ) ع ل م‬ Teach Pattern ( ٌ ‫ ) م * َ * ِ * َ ْ ت‬ ( noun, plural, feminine, subject) Final Meaning : Teachers (Feminine) Figure 8. Components of MAAW 8 ( 6. Experiments 7. Conclusion The Morphological Analyzer proposed in this paper can be applied in different ways. It can be used as an independent system or as a part of one of the NLP applications. The output of the Analyzer determines weather the word is generated from a root or not. If it is generated from a root according to a pattern, this information may be sent to a lexicon to extract the meaning or meanings of the word. The richness of the lexicon determines the ultimate performance of the complete system. Figure 9 gives a complete example of the application of the rules of MAAW. The proposed Analyzer gives accurate results for fully diacritisized Arabic text. In the case of the absence of diacritical marks, it produces a list of consonants that correspond to the morphological patterns. In these cases, the definitive list of diacritics replacing the missing diacritics requires advanced syntactical analysis. The Morphological Analyzer of Arabic words presented in this paper aims to prepare Arabic text for natural language processing applications. It analyzes Arabic words in order to extract their morphological structures. This task poses many problems related to the specific features of the Arabic language, such as the presence of diacritics and the elimination of some letters in the generation of words. In order to solve these problems, we introduced a rule-based system that takes a word and determines its morphology. The system has three components: the rules, the lexicon and the morphological patterns of the language. The morphological analyzer proceeds by matching to determine the root and the morphological pattern of the word. In the case of missing diacritics, more than one pattern may be a candidate to the final output. Future work will focus on an expansion of the Morphological Analyzer toward the use of syntactical information in order to determine the definitive morphological pattern used to build the word in case of the absence of diacritical marks. ‫يـ ْلـعـب ْـون‬ [ ‫َُ و ْ ن‬ The input word: List of characters: َ ‫ي ل ْع‬ َ] List of consonants: [ List of diacritical marks: [ ‫ون‬ َْ ‫]ي ل ع‬ َُ َ َْ َ َ] The pattern that will be detected: [ ‫] ي * * * و ن‬ Positions of the root letters: [2 ,3,4 ] The root will be: [ ‫]ل ع‬ The Lexicon output: ROOT  Play PATTERN  Verb , Present, Plural, Masculine Figure 9. Complete Example of a Word Analyzed by MAAW 9 ACL Special Interest Group in Computational Phonology, Luxembourg, pp. 1-12, August 2000. References [1] N. Ali, N Hegazi and E. Abed, “A Morphology Based Data Compression Technique For Arabic Text”, Computer Communications–AFRICOM84, pp. 241251, 1984. [6] L. Breidt and F. Segond, “IDAREX: Formal Description of German and French Multi-Word Expression with Finite State Technology”, MLTT-022, Novembre 1995. [2] E. L. Antworth, “Morphological Parsing with a Unification-Based Word Grammar”, North Texas Natural Language Processing Workshop, University of Texas at Arlington, http://www.sil.org/ pckimmo/ ntnlp94.html, May 1994. [7] D. Carter, “Rapid development of Morphological Descriptions for Full Language Processing Systems”, http://www.cam.sri.com/tr/crc047paper/pa per.html, 1997. [3] A. Awajan, “Low-Level NLP Technique for Arabic Text Processing”, The Proceedings of the ISCA 16th International Conference on Computers and Their Applications, Seattle, Washington USA, pp. 287-289, March 2001. [8] J. P.Chanod, “Finite State Composition of French Verb Morphology”. MLTT Technical reports, http://www.rxrc.xerox.com/publis/mltt/mlt ttech.html, November 1994. [4] K. R. Beesley, “Arabic Finite-State Morphological Analysis And Generation”, The 16th International Conference On computational Linguistics, Proceeding. Vol. 1, pp. 89-94, August 1996. [9] G. Grefensette and P. Tapanainen, “What is a word, What is a sentence, Problems of Tokenization”, The 3rd Conference on Computational Lexicography and Text Research. Complex ’94, Budapest, July 1994. [5] K. R. Beesley and L. Karttunen, “Finite-State Non-Concatenative Morphotactics”, SIGPHON-2000, Proceeding of the 5th Workshop of the 10

Abstract: - This paper describes a rule-based technique for

Related documents

Products

Support

Abstract: - This paper describes a rule-based technique for

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib