Abstract: - This paper describes a rule-based technique for

advertisement
A Rule-Based Morphological Analyzer of Arabic Words
ARAFAT AWAJAN
Computer Science Department
Princess Sumaya University College for Technology
Royal Scientific Society
Amman - JORDAN
Abstract: - This paper describes a rule-based technique for analyzing the morphology of
Arabic words. The proposed ‘Morphological Analyzer’ processes the input word in order to
determine its lexical form. The lexical form of the majority of Arabic words consists of a root
and a morphological pattern. The analyzer applies a set of predefined rules in order to analyze
the morphology of Arabic words as they appear in real text. It is able to recognize
diacriticized, undiacriticized or partially diacriticized Arabic words generated from N-letter
roots. In order to determine the possible meanings of a word, the Morphological Analyzer
also provides some useful attributes of the word such as its type, gender, tense and number.
The proposed Morphological Analyzer is a general-purpose technique that can be integrated
into larger scale systems such as automatic translation applications, text summarization
applications, text correction applications, web search engines, automatic vowelization of
Arabic text applications and other natural language processing applications.
Keywords: - Natural Language Processing, Arabic Word Recognition, Lexical Form, Roots,
Morphological Patterns, Morphological Analyzer.
the subject of the morphological analysis
of Arabic words. Most of these published
works ignore the presence of diacritics in
the Arabic text or limit the analysis to
words generated from 3-letter roots [1] [4]
[3].
The rule-based Morphological Analyzer
presented in this paper has the objective of
finding the lexical form and the possible
meanings of each word in a text written in
Arabic language. The proposed analyzer is
being developed in order to analyze
Arabic words as they appear in real text. It
can be applied in the case of diacriticized,
undiacriticized or partially diacriticized
Arabic words. Furthermore, it allows the
morphological analysis of words generated
from variable root-lengths.
Our approach is based on the use of the
specific features and structures that the
Arabic language uses for generating
words. It applies a set of predefined rules
specific to the Arabic language in order to
extract the lexical structure of the word
which generally consists of a root and a
morphological pattern. A classical lexicon
1. Introduction
For the Arabic language, as well as for
many other languages, the morphological
features of a word provide crucial
information to enable understanding of
text and information extraction. In fact, the
possible meanings of individual words
depends mainly on their morphology and
their position in a sentence. Therefore, the
possible meanings of a word must be
determined first in order to accomplish the
understanding of text written in a natural
language.
A number of research papers concerning
the morphological analysis of words have
been published for various natural
languages, particularly the European
languages [2] [5] [8] [6] [9]. Descriptions
of real systems for analyzing the
morphology of these languages are also
available [7]. These works show that the
complexity of the morphological analysis
of words varies from one natural language
to another. There have been fewer articles
and published research papers written on
1
is used to verify the correctness of the
analyzed word and to determine the
meanings it could take.
The morphological information that our
technique is able to extract gives vital
support to the different fields and
applications
of
Natural
Language
Processing. The purpose of the
Morphological Analyzer is to preprocess
Arabic text in order to prepare it for some
automated treatment such as humanmachine interaction, translation, text
summarization, text correction, automatic
vowelization of Arabic text, web search
engines and other applications of natural
language processing.
considered to deal with the irregularities
present in almost all the natural languages.
The morphology of the Arabic language is
based on the Semitic root-and-pattern
scheme of forming words. Therefore, the
majority of words are generated from
basic entities called roots or radicals
according to a predefined list of patterns
called morphological balances or patterns
[1] [4] [3]. The roots are constructed
mainly from 3 letters, although 4 and 5letter roots exist too. The morphological
patterns represent the major spelling rules
of Arabic words. This mechanism of
Arabic word generation is called ‘ALISHTIQAQ.’
This
mechanism
is
performed by adding letters and/or
diacritical marks to the roots. These
additional letters and diacritical marks
may be added at the beginning, at the
middle or at the end of the root. In this
paper, a morphological pattern is
represented by the additional parts, their
positions and the slots where the letters of
a root can be inserted. The character “*”
represents the slots of the root’s letters.
Figure 1 contains examples that illustrate
the “AL-ISHTIQAQ “ mechanism, it
presents words generated from the same
root “ K T B “ according to different
morphological patterns. It is important to
note the role that diacritics play in fixing
the meaning of the first and second words
of Figure 1.
2. The Morphology of Arabic
Words
In many European languages words are
constructed from basic units called
morphemes by adding a suffix and prefix.
A morpheme is the primitive unit of
meaning in a language. For example, the
meaning of the English word ‘friendly’ is
derivable from the meaning of the noun
‘friend’ and the suffix ‘–ly’ that
transforms a noun into an adjective [3]. In
such cases the morphological analysis is
based on the elemination of affixes and the
extraction of the basic morpheme of a
word. Special treatment is always
Example of words generated from the same root
“K T B “ ‫ك ت ب‬
The generated words
‫ََب َ َت‬
‫َبِ َت‬
‫َ باب‬
‫َ ات ت‬
‫يكتبون‬
Their meaning in
English
(He) wrote
(It is ) Written
Book
Writer
(They are) writing
Morphological pattern used
for building the word
َ * َ *َ *
َ*َ *َ *
*‫* * ا‬
* * ‫* ا‬
‫يـ * * * و ن‬
Figure 1. AL-ISHTIQAQ Mechanism
All classifications of words (verbs, nouns,
adjectives and adverbs) can be generated
from roots according to the appropriate
patterns. The pattern used for generating a
2
word determines its various attributes such
as gender (masculine/feminine), number
(singular/plural), tense (past, present, and
imperatives), mode etc. Figure 2 presents
an example that shows the importance of
the standard Arabic morphological
patterns in fixing the meaning of a word.
Based on the above, an Arabic word can
be represented lexically by its root, along
with its morphological pattern. The latter
is one element of a countable set of limited
size. A pattern is defined by a set of
additive letters and/or a set of diacritical
marks and their positions in the generated
word.
by the diacritical marks. These marks are
classified into the following categories: [1]
 Three diacritical marks to indicate
the short vowels ( ِ ُ َ ),
 Double diacritical marks which
combine the single ones ( ٍ ٌ َ
),
 Single Diacritical mark to indicate
absence of vowelization ( َ ),
 A single diacritical mark to indicate
the duplicate occurrence of a
consonant ( َ )
According to the extent that diacritics have
been used, Arabic text may be classified
into three different categories:
undiacriticized, partially diacriticized, and
fully diacriticized text. The first category
represents text without diacritics such as
typed or printed text and newspapers. The
second category represents partially
diacriticized text where diacritical marks
are added to eliminate the ambiguities of
some words. The last category represents
fully diacriticized Arabic text, according
to which every consonant is followed by a
diacritical mark. Such a format is used for
writing the Holy Koran, classic Arabic
literature and children’s educational
books..
The third challenge is that not all the
words in Arabic text are generated from a
root. For example, some words such as the
tools and foreign words cannot be broken
down into a root and pattern. As the
number of tools is limited, a table of these
predefined tools can be used to check
whether a word is a tool or not before
sending it to the analyzer. Meanwhile the
‘loan’ or foreign words, are listed in the
lexicon
and
need
not
undergo
morphological analysis.
3. Arabic Language Features
and Challenges
The formation of Arabic words presents
specific features and challenges that must
be taken into consideration when fixing
the rules used by the morphological
analyzer. The first challenge is that some
letters of the root may be dropped or
modified during the generation of words
from roots. The analyzer has to rebuild the
original root-letters by retrieving the
missing or modified letters of the word.
The second challenge is the presence of
eight different types of diacritical marks,
used to represent short vowels. In written
text they are considered as special letters
where each one is assigned a single code,
as with normal letters. In fully
diacriticized text a diacritical mark is
added after each consonant of the word.
These diacritical marks play a very
important role in fixing the meaning of
words. In fact, two different patterns may
have the same sequence of consonants, but
one is distinguished from the other solely
The word ( ‫ ) لـاعـبـون‬is generated by the root play ( ‫ ) لـعــ‬according to the
pattern ( ‫) فاعـلــون‬. This pattern indicates that the word is a noun, its
gender is masculine, and it is plural.
The final meaning will be players: (play: noun; plural; masculine)
Figure 2. Role of the Morphological Pattern of an Arabic Word in Fixing its
Meaning
3
mark EXTRA-SECOUN. A word is then
represented by a list of character L
according to the next format:
4. The Morphological Analyzer
The Morphological Analyzer of Arabic
words (MAAW) processes each word of
the input text in order to determine its root
and pattern. The results of the
morphological analyzer can be used for
further analysis. Figure 3 presents these
transformations schematically.
The identification of the morphological
structure of a word depends on a rulebased system that can find the
morphological pattern for diacriticized or
undiacriticized words. To achieve this
process, we assume that a diacritic follows
each letter of the word. If a diacritic is
omitted, it will be replaced by a special
character (EXTRA-SECOUN) that we
introduce to replace the absent diacritic.
This diacritic (EXTRA-SKOUN) will be
noted by a dot in the examples of this
paper.
A procedure ‘ Check_Diacritics’ takes the
list of characters forming the word and
checks for the presence of diacritics after
each consonant. It replaces the absence of
diacritical marks after a consonant by the
[C1 V1 C2 V2 . . . Cn Vn]
Where Ci is a consonant and Vi is a
diacritical mark. Each one of the classical
patterns is also represented by a list of the
same structure where the slots of a root’s
letters are marked by the character ‘*’.
Figure 4 shows an example of a classical
pattern representation.
To deal with the three possible situations
of Arabic text (fully diacriticized, partially
diacriticized and undiacriticized text), the
list L will be further divided into two new
lists. The first list LC contains the
sequence of consonant [C1, C2, . . . Cn]
and the second list LV contains the
diacritical characters [V1, V2, . . . Vn].
Table 1 shows examples of the
segmentation of words into consonants
and diacritics. The three examples given in
Table 1 share the same list of consonants
LC.
Original Text
Morphological Analyzer : MAAW
Morphological Features
Further Analysis (NLP Applications)
Figure 3. Morphological Analyzer
The pattern:
“ ‫ــون‬
ْ ‫“ يَـ ْفـعــ ْل‬
Its corresponding list : [ َ
*
َ
*
َ
*
Figure 4. Pattern Representation
4
‫ي‬
].
Word
Word Class
ْ ‫ي‬
‫ـذهـبُ ْـون‬
Fully diacriticized
Partially diacriticized
Undiacriticized
‫يـذهـبـون‬
‫يـذهـبـون‬
List of
Consonants
[ ‫] ي ذ هـ و ن‬
[ ‫] ي ذ هـ و ن‬
[ ‫] ي ذ هـ و ن‬
List of Diacritics
[ َ َْ َ ْ َ ]
[ َ . . َ . َ ]
[ . . . . . .]
Table 1. Decomposition of Words into a List of Consonants and a List of
Diacritics
The list of consonants (LC) represents the
letters of the word’s root, and the suffixes,
infixes and prefixes used to form the word
according to a given pattern. In order to
extract the root of a word, the list LC can
be represented by the following general
description:
representation allows us to manipulate all
kind of roots (3-letters roots, 4-letters
roots and 5-letters roots). Table 2 gives
examples of the above representation. The
first two words are generated from two
different three-letter roots according to the
same morphological pattern, they share the
same additive parts (prefix, infix and
postfix). The last three words are
generated from the same root according to
different patterns.
The morphological patterns will also be
segmented into two lists: LC and LV. For
example the pattern presented above in
Figure 4 can be broken down into two
lists: a list of consonants (LC) and a list of
diacritical marks (LV) [Figure 5]. The
separation of consonants and diacritics
significantly reduces the number of
patterns to be tested.
[X1[X2[X3]]] R1 [Y1] R2 [Y2] R3 [
[Y3] R4 [[Y4] R5]] [Z1[Z2[Z3]] ]
where components X1X2X3 represent a
prefix of 3 letters maximum, the
components Z1Z2Z3 represents a postfix
of three letters maximum, and components
Y1Y2Y3Y4 represent the possible infixes
of four letters maximum. The slots R1, R2,
R3, R4, and R5 represent the letters of the
root used to generate the word. The
characters [ ] are used here to indicate that
the included component is optional. This
Input
Word
List of
Consonants
‫سيـذهبـون‬
‫سيـدرسـون‬
‫دارسـون‬
‫مـدرسـون‬
[ ‫] س ي ذ هـ و ن‬
[‫]سيدرسون‬
[‫] د ارسون‬
[‫] مدرسون‬
Root
R1R2R3
[ ‫] ذ هـ‬
[‫]درس‬
[‫]درس‬
[‫]درس‬
Prefix
X1X2X3
[‫]سي‬
[‫]سي‬
[ ]
[ ‫]م‬
Infix
Y1 Y2
[ ]
[ ]
[‫]ا‬
[‫]اي‬
Postfix
Z1Z2Z3
[‫]ون‬
[‫]ون‬
[‫]ون‬
[ ]
Table 2. Decomposition of the List of Consonants
The pattern:
“
‫يَـفـعــلــ‬
“
List of consonant (including the slots for root) : [
List of Diacritical marks:
[
*
َ
*
‫ي‬
*
َ
َ
].
َ
].
Figure 5. Decomposition of Patterns Into a List of Consonants and a List of
Diacritics
5
The LC list of a given pattern will be
represented by the following structure:
5.
Components
of
Morphological Analyzer
[ X1 [ X2 [ X3 ] ] ] * [ Y1 ] * [ Y2 ] * [
[Y3] * [[Y4] * ] ] [ Z1 [ Z2 [ Z3 ] ] ]
The main components of the proposed
Morphological Analyzer (MAAW) are
shown in Figure 8. It has three analytical
components: the ‘rules’ component, the
‘lexical’ component, and the ‘patterns’
component.
First, the rules component consists of an
engine containing the rules used to extract
diacritics, and the rules used to extract the
patterns and roots. Second, the pattern
component lists the standard patterns,
where we associate with each entry all the
possible and acceptable configurations of
diacritical marks, the number of
configurations is limited to a maximum of
5. Third, the lexicon has a classical form
and lists the roots of the Arabic language,
and for each root the possible patterns that
can be applied to generate words from the
root. The lexicon is used to verify the
correctness of the analysis performed by
the other components of MAAW. If the
word-correctness is verified , the extracted
root, pattern and list of diacritics will be
used by the lexicon to identify its possible
meanings.
The characters ‘*’ represent slots where
consonants can be inserted to form a real
word. Table 3 shows examples of the
representation of morphological patterns
using this schema. A comparison of
Tables 2 and 3 shows that the words of
Table 2 share the same prefix, infix and
postfix parts with the patterns of Table 3,
which means that the words of Table 2 are
generated according to the corresponding
patterns of Table 3.
Morphological patterns can be regrouped
into classes according to their list of
consonants. Patterns of the same class
share the same list of consonants and they
differ one from the other according to the
list of diacritical marks. Table 4 shows an
example of three different patterns of the
same class; these patterns have the same
list LC and different lists LV. The set of
patterns will be represented by the set of
consonant lists LC, where we associate
with each entry all the possible and correct
combinations of diacritical marks LV. The
couplet LC and LV will determine the
morphology of the word.
Pattern
The List PLC
‫سيـفـعـلـون‬
‫فـاعـلـون‬
‫مـفـاعـيـل‬
[ ‫] س ي *** و ن‬
[‫] * ا ** ون‬
[ *‫] م*ا* ي‬
the
Prefix
X1X2X3
[‫]سي‬
[ ]
[ ‫]م‬
Infix
Y1 Y2
[ ]
[‫]ا‬
[‫]اي‬
Postfix
Z1Z2Z3
[‫]ون‬
[‫]ون‬
[ ]
Table 3. Examples of Pattern Representation
Pattern
‫يـ ْفـعـلُ ْـو ْن‬
‫يُـ ْفـعـلُ ْـو ْن‬
‫يـ ْفـ ّعـلً ْـو ْن‬
List of Consonants
LC
[ ‫] ي *** و ن‬
[ ‫] ي *** و ن‬
[ ‫] ي *** و ن‬
List of Diacritical Marks
LV
[ ْ َْ
ْ َ ]
[ ْ َْ ُ َْ َُ ]
[ ْ َْ َُ ّ َْ َُ ]
Table 4. Grouping Patterns According to Their List of Consonants
The recursive rule ‘Decompose’ performs
the decomposition of the word into two
lists; one for consonants, LC, and the
second for diacritics, LV. Decompose
6
calls another rule In. The rule ‘In’ returns
TRUE only if the character H is one of the
diacritical marks of the Arabic language. It
marks the absence of diacritics by adding
the EXTRA_SECOUN mark ‘dot.’ The
recursive rule Decompose can be
described by the following Prolog style
code [Figure 6].
The step of identification of the root and
pattern is realized by a recursive procedure
‘Match’ that takes the list LC of the input
word and returns the list PLC of
consonants of the pattern and the root
ROOT. The Prolog-style description of the
recursive rule Match is presented in
[Figure 7].
Applying the rules relaying the pattern to
the slots of the root letters identifies the
root of the word. The rule defined by
‘FindSlot’ returns a list of integers
determining the position of the letters of
the root in the given pattern. It is
necessary to use two separate rules to
identify the root, to accommodate cases
where one of the letters of a root is
dropped or changed.
The lexical componant of the system
receives the results of the precedentanalysis: the root, the list of consonant of
the pattern PLC and the list of diacritics of
the word LV. The lists LV and PLC
determine the pattern of the analyzed
word. It then verifies the correctness of the
word. An Arabic word is correct lexically
if its root is an entry of the lexicon and its
pattern is among the acceptable patterns of
this root. If the word is correct, its
meaning or meanings are provided by the
lexicon.
Decompose (word, LC, LV, Flag):
Decompose ([ ] , _ , _ , False ).
// Basic case when the decomposition is terminated
Decompose ([ ] , _ , T2 , True ):
// The last consonant is not
Decompose([ ] , _ , [‘.’ | T2] , False).// followed by a diacritc
Decompose ([H|T], _ , T2, _ ): In (H, DiacriticsList). // Detection of a diacritic
Decompose(T , _ , [H|T2] , False);
Decompose ([H|T], T1 , _ , False ): // Detection of a consonant at
Decompose(T , [H|T1] , _ , True); // the first call of the rule or
// after detection of a diacritic.
Decompose ([H|T], _ , T2 , True): // detection of 2
Decompose([H|T] , _ , [‘.’,T2] , False ) //consecutive consonant
Figure 6. The recursive Rule Decompose
Match (LC, PLC , ROOT):
FindPattern (LC , PLC),
FindSlots (PLC , [list_of_slots]),
FindRoot (LC , [List_of_slots], ROOT , 1);
FindPattern (LC , PLC ):
FindPattern ([],[]).
FindPattern ([Head|Tail1], [‘*‘|Tail2]): FindPattern (Tail1, Tail2).
FindPattern ([Head|Tail1], [Head|Tail2]): FindPattern (Tail1, Tail2).
FindRoot (LC , [List_of_slots], ROOT, Pos):
FindRoot (LC,[],[], _ ).
FindRoot ([Head|Tail1], [H|T], [Head|Tail2], X):
X = H,
FindRoot (Tail1, T, Tail2, X+1).
FindRoot ([Head|Tail1], L1, L2, X): FindRoot (Tail1, L1, L2, X+1).
Figure 7. The recursive Rule March
7
Input word
) ‫( مـعـلـِمـاْت‬
Decomposition
Rules
List LC
)‫(م ع ل م ا ت‬
iL t iL
)
ْ ِ
Matching
Rules
toot
) ‫(ع ل م‬
nrtttaP
)‫ا ت‬
‫( م‬
* * *
LEXICON
RESULTS
Valid Word ? Yes
Meaning:
Root ( ‫ ) ع ل م‬ Teach
Pattern ( ٌ ‫ ) م * َ * ِ * َ ْ ت‬ ( noun, plural, feminine,
subject)
Final Meaning : Teachers (Feminine)
Figure 8. Components of MAAW
8
(
6. Experiments
7. Conclusion
The Morphological Analyzer proposed in
this paper can be applied in different ways.
It can be used as an independent system or
as a part of one of the NLP applications.
The output of the Analyzer determines
weather the word is generated from a root
or not. If it is generated from a root
according to a pattern, this information
may be sent to a lexicon to extract the
meaning or meanings of the word. The
richness of the lexicon determines the
ultimate performance of the complete
system.
Figure 9 gives a complete example of the
application of the rules of MAAW. The
proposed Analyzer gives accurate results
for fully diacritisized Arabic text. In the
case of the absence of diacritical marks, it
produces a list of consonants that
correspond to the morphological patterns.
In these cases, the definitive list of
diacritics replacing the missing diacritics
requires advanced syntactical analysis.
The Morphological Analyzer of Arabic
words presented in this paper aims to
prepare Arabic text for natural language
processing applications. It analyzes Arabic
words in order to extract their
morphological structures. This task poses
many problems related to the specific
features of the Arabic language, such as
the presence of diacritics and the
elimination of some letters in the
generation of words.
In order to solve these problems, we
introduced a rule-based system that takes a
word and determines its morphology. The
system has three components: the rules,
the lexicon and the morphological patterns
of the language. The morphological
analyzer proceeds by matching to
determine the root and the morphological
pattern of the word. In the case of missing
diacritics, more than one pattern may be a
candidate to the final output.
Future work will focus on an expansion of
the Morphological Analyzer toward the
use of syntactical information in order to
determine the definitive morphological
pattern used to build the word in case of
the absence of diacritical marks.
‫يـ ْلـعـب ْـون‬
[ ‫َُ و ْ ن‬
The input word:
List of characters:
َ ‫ي ل ْع‬
َ]
List of consonants:
[
List of diacritical marks:
[
‫ون‬
َْ
‫]ي ل ع‬
َُ
َ َْ
َ
َ]
The pattern that will be detected: [ ‫] ي * * * و ن‬
Positions of the root letters:
[2 ,3,4 ]
The root will be:
[
‫]ل ع‬
The Lexicon output:
ROOT
 Play
PATTERN  Verb , Present, Plural, Masculine
Figure 9. Complete Example of a Word Analyzed by MAAW
9
ACL Special Interest Group in
Computational Phonology, Luxembourg,
pp. 1-12, August 2000.
References
[1] N. Ali, N Hegazi and E. Abed, “A
Morphology Based Data Compression
Technique For Arabic Text”, Computer
Communications–AFRICOM84, pp. 241251, 1984.
[6] L. Breidt and F. Segond, “IDAREX:
Formal Description of German and French
Multi-Word Expression with Finite State
Technology”, MLTT-022, Novembre
1995.
[2] E. L. Antworth, “Morphological
Parsing with a Unification-Based Word
Grammar”, North Texas Natural Language
Processing Workshop, University of Texas
at
Arlington,
http://www.sil.org/
pckimmo/ ntnlp94.html, May 1994.
[7] D. Carter, “Rapid development of
Morphological Descriptions for Full
Language
Processing
Systems”,
http://www.cam.sri.com/tr/crc047paper/pa
per.html, 1997.
[3] A. Awajan, “Low-Level NLP
Technique for Arabic Text Processing”,
The Proceedings of the ISCA 16th
International Conference on Computers
and
Their
Applications,
Seattle,
Washington USA, pp. 287-289, March
2001.
[8] J. P.Chanod, “Finite State Composition
of French Verb Morphology”. MLTT
Technical
reports,
http://www.rxrc.xerox.com/publis/mltt/mlt
ttech.html, November 1994.
[4] K. R. Beesley, “Arabic Finite-State
Morphological Analysis And Generation”,
The 16th International Conference On
computational Linguistics, Proceeding.
Vol. 1, pp. 89-94, August 1996.
[9] G. Grefensette and P. Tapanainen,
“What is a word, What is a sentence,
Problems of Tokenization”, The 3rd
Conference
on
Computational
Lexicography and Text Research.
Complex ’94, Budapest, July 1994.
[5] K. R. Beesley and L. Karttunen,
“Finite-State
Non-Concatenative
Morphotactics”,
SIGPHON-2000,
Proceeding of the 5th Workshop of the
10
Download