Arabic Discourse Segmentation Based on Rhetorical

advertisement
International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01
10
Arabic Discourse Segmentation Based on
Rhetorical Methods
Iraky Khalifa, Zakareya Al Feki and Abdelfatah Farawila

Abstract —
The discourse segmentation problem in Arabic
language has not been fully addressed. A technique to segment
Arabic discourse into complete sentences is presented. The
technique is derived from Arabic Rhetorical system by
exploiting the main crucial connector "‫"و‬, as defined by Arabic
linguists almost one thousand years ago. This approach
categorizes the six known rhetorical types of "‫ "و‬into two
classes: segment and unsegment, known as, "Fasl" and "Wasl".
S egmentation places are decided according to the type of
connector "‫"و‬. A set of twenty two syntactic and semantic
features devised from "Fasl and Wasl" rhetorical methods, are
chosen to categorize each type of "‫"و‬. The system undergoes the
learning and testing stages, using S VM machine learning
technique to identify the types of the connector "‫"و‬. An Arabic
discourse corpus is particularly developed for this experiment.
We achieved results with an accuracy of 97.95% of discourse
segmentation.
Index Term—
Arabic rhetoric methods "Fasl and Wasl",
discourse segmentation, , machine learning, Rhetorical
S tructure Theory (RS T), S upport Vector Machine (S VM).
I.
INT RODUCT ION
A sentence is the part of a speech or a written discourse that
has a complete and independent meaning. Sentence
segmentation refers to indentifying sentences in an
unstructured text. The process of sentence segmentation is a
basic step for discourse analysis processing systems. It is
because, any text stream needs to be separated into coherent
sentences in order to enable effective automatic analysis, such
as information retrieval, summarization, understanding and
translation. It is very important to first define what is meant by
a complete and independent sentence. Some researchers have
defined sentence, as a finite clause that has a complete and
independent meaning [13]. The Cambridge Encyclopedia of
Language defines a sentence as the largest unit to which
syntactic rules apply [8]. All computational linguistic systems
Manuscript
received January
24, 2011. Arabic Discourse
Segmentation Based on Rhetorical Methods.
Iraky Khalifa is with the Computer Science Department, Helwan
University, Helwan, Egypt. (e-mail: Dr_iraky@hotmail.com).
Zakareya Al Feky is with the Arabic Language Department, University
of Alexandrai, Alexandria, Egypt (e-mail: ahad66@yahoo.com).
Abdelfatah Farawila is with the Computer Science Department, Helwan
University, Egypt. (phone: +966-507810636; e-mail: a_farawila@
yahoo.com), (Corresponding author).
that encode and analyze discourse texts, such as Rhetorical
Structure Theory (RST), need to answer the following
question: How to segment a discourse? This question has
been answered, to a certain extent; for some languages such as
in English, French, Chinese, Polish, Spanish, etc.[22], but a
little work in Arabic has been done. This is due to the distinct
and unique characteristics of Arabic language. In present
study, we introduce a new method of segmenting an Arabic
discourse into its sentence units. However, Arabic sentence
segmentation processing is deemed hard due to two main
difficulties: the lack of an Arabic corpus dedicated for sentence
segmentation, and the very special nature of Arabic language.
An Arabic Corpus is developed, particularly for the training
and testing the segmentation experiments in this study. The
proposed segmentation method is syntactic/semantic based,
and it comprises two ideas: the Arabic rhetorical methods ;
"Fasl and Wasl"; of discourse segmentation as defined by
Arab linguists, and the supervised machine learning with
Support Vector Machine (SVM).
It is realized that the connector " ٔ /and/Waw" is the most
ambiguous connector due to its mostly rhetorical use [4]. In
the Arabic rhetoric system, the meaning of "ٔ" plays a great
role of understanding consecutive sentences , and in turn
determines the places of sentence endings [1]. Historically, this
problem was addressed long time ago, by a prominent Arabic
linguist, "Abdel Quaher Al-Jorjany ( ‫(ػجدانقبْس انجسجبَي‬, died in
471 Higri". In his book "Dalael Al Eegaaz ‫"دالئم اإلػجبش‬, he
defines an approach, called "Fasl and Wasl", which means,
"identifying segmentation places in a text" [17]. This approach,
identifies sentence ending places by understanding the
meaning of the connector " ٔ" rather than other sentence
connectors such as: "‫ انخ‬... ‫ ثى‬, ‫"فـ‬, because their functions as a
sentence separator are evidently known [1]. In this paper, we
use "ٔ" and "Waw" interchangeably to denote the connector
"ٔ", and similarly, we use "Fasl" and "Segment", and "Wasl"
and "Unsegment" respectively. According to "Fasl and Wasl"
rules; there are six different meanings of " ٔ", three of these
signal to a segmenting place; i.e., "Fasl"; whereas the other
three types are used when the context implies connecting the
text before and after it, i.e., "Wasl' or Unsegment. Table I
describes these six types of " ٔ", and their segmentation
effects. The proposed method consists of two phases: 1)
training; which characterizes the feature of each " ٔ ", and 2)
testing. Support Vector Machine (SVM) is used during both,
training and testing phases. The significance of the proposed
approach is that it is built on the well established Arabic
rhetoric segmentation rules, "Fasl and Wasl ‫[ "انفصم ٔانٕصم‬17].
112701-8989 IJECS-IJENS © February 2011 IJENS
I J ENS
International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01
This paper is organized into seven sections. Section II
describes the rules of sentence segmentation in Arabic rhetoric
system. Section III surveys some related work. Section IV
presents a brief account of the development of the proposed
Arabic Corpus. Section V explains the proposed Arabic text
segmentation technique. Section VI gives experimental results
along with some discussions. Final section of this paper
concludes it. Because the paper contains some Arabic words
and terms, which may cause some difficulties for non Arabic
speakers, an Appendix is added at the end of this paper to
translate Arabic terms mentioned in this paper into English.
II. TYPES OF T HE CONNECT OR "ٔ" IN "FASL AND W ASL"
A RABIC RHET ORICS
The law of "Fasl and Wasl"; as defined by "Abdel Quaher
Aljorjany ‫"ػجدانقبْس انجسجبَي‬, is shown below in Fig. 1. It is
interpreted, thereafter by the Arabic linguists when they
related the segmentation places of "Fasl and Wasl" to the
meaning of the "ٔ". There are six types of connector, "ٔ" in
terms of meaning [2]. They are clustered into two classes:
"Fasl" or "Wasl". The class "Fasl" contains three types of,
"ٔ":
1)
Waw1:"‫"ٔانقعى‬,
2)
Waw2:" ‫" ٔزة‬,
and
3)Waw3:" ‫"ٔاالظزئُبف‬. The second class, "Wasl", contains the
rest three types of, "ٔ": 4)Waw4:"‫"ٔانحبل‬, 5)Waw5:"‫"ٔانًؼيخ‬, and
6)Waw6: "‫[ "ٔانؼطف‬2]. These six types of connector, "ٔ", their
names, meanings, and the class which each type belongs to,
are shown in Table I.
:‫انجًهخ نٓب ثالثخ أظسة‬
ّ‫ جًهخ حبنٓب يغ انزٗ قجهٓب حبل انصفخ يغ انًٕصٕف ٔ ثبنزأكيد يغ انًؤكد فال يكٌٕ فيٓب انؼطف انجزخ نشج‬- 1
.ّ‫ ثؼطف انشئ ػهٗ َفع‬- ‫ نٕ ػطفذ‬- ‫انؼطف فيٓب‬
‫ جًهخ حبنٓب يغ انزٗ قجهٓب حبل االظى يكٌٕ غيس انرٖ قجهّ إال أَّ يشبزكّ فٗ حكى ٔ يدخم يؼّ فٗ يؼُٗ يثم‬- 2
.‫أٌ يكٌٕ كال االظًيٍ فبػال أٔ يفؼٕال أٔ يعبفب إنيّ فيكٌٕ حقٓب انؼطف‬
ٌٕ‫ جًهخ نيعذ فٗ شئ يٍ انحبنيٍ ثم ظجيهٓب يغ انزٗ قجهٓب ظجيم االظى يغ االظى ال يكٌٕ يُّ فٗ شئ فال يك‬- 3
‫إيبِ ٔ ال يشبزكب نّ فٗ يؼُٗ ثم ْٕ شئ إٌ ذكس نى يركس إال ثأيس يُفسد ثّ ٔ يكٌٕ ذكس انرٖ قجهّ ٔ رسك انركس‬
.‫ظٕاء فٗ حبنّ نؼدو انزؼهق ثيُّ ٔ ثيُّ زأظبً ٔ حق ْرا رسك انؼطف انجزخ‬
Sentences fall in three types :
1- A sentence describes its predecessor being as an adjective of a noun. So, a
conjunction is never used as it can be used as a semi conjunction if we consider
two sentences one describing the other as conjunction.
2- A sentence following its preceding sentence is like a noun different from its
preceding noun but both share a position and a meaning like a situation where
the two names are subjects, objects or attaché.
3- A sentence different from both cases above as its position with the preceding
one is the same as a noun to a noun completely different, not being the same or
sharing a meaning but it is something, if mentioned, it is mentioned uniquely. In
this case mentioning or non mentioning of the previous sentence is the same as
there is no relation whatsoever. This implies no conjunction at all as dropping
conjunction is either for connection to reach the meaning or disconnection to
reach the meaning, and conjunction is for means between the two cases and it has
a situation between two situations.
B. Waw 2 "‫"ٔزة‬:
‫انشجبة نيعٕ ا ٔحدْى انريٍ يؼبٌَٕ ثم إٌ أشيبرٓى جصء يٍ أشيبد انًجزًغ كهّ ٔزة‬
(2) ‫ نًبذا زكصرى ػهٗ "انشجبة" يٍ ثيٍ غجقبد انًجزًغ؟‬:‫ظبئم يقٕل‬
[Young people are not the only ones who suffer, but their
crises are part of the crises of the whole society and someone
may ask: Why have focused only on youth only and not on
the divisions of the whole society?]
In text (2), the "ٔ" along with " ‫ "زة‬give the meaning: few or
someone.
C. Waw 3 " ‫" ٔاالظزئُبف‬:
ٖ‫يؼبَي انًساْقٌٕ يٍ ثؼط انًشكالد انُفعيخ ٔ انًجزًغ ػبيخ ثّ ظهجيبد أخس‬
(3) .‫كثيسح‬
[Adolescents suffer from some psychological problems and
there are, in general, other numerous problems in the society.]
In text (3), the "ٔ" does not indicate any specific meaning,
rather than joining two unrelated sentences. In the above three
examples, the, "ٔ", refers to segmentation places according to
Arabic rhetoric methods. These three types are contained in
the class "Fasl".
In the other hand, the following three examples, show the
other three types of "ٔ" which are contained in the "Wasl"
class. They have unsegmenting effect because the meanings
before and after the, "ٔ", are related.
D. Waw 4 " ‫" ٔانحب ل‬:
(4)
.‫دخم انًدزض انفصم ْٕٔ يجزعى‬
[The teacher came smiley into the classroom]
In (4), the "ٔ" indicates that its sentence "smiley into the
classroom" acts as an adverb of state for the previous
sentence "The teacher came".
E. Waw 5 " ‫" ٔانًؼيخ‬:
(5) .‫جهط انحجيجبٌ ٔظٕء انقًس‬
[The couple sat together with the light of the moon]
In text (5), the "ٔ" indicates that its following sentence acts
as an object of accompaniment for the previous one.
F. Waw 6 " ‫" ٔ انؼطف‬:
(6) .‫ثدأد اندزاظخ ٔاَزظى انًؼهًٌٕ ٔ انطالة في انًدازض‬
[The study started and students and teachers enrolled in
schools]
In (6), the "ٔ" is a conjunction of related words or sentences.
T ABLE I
T YPES OF T HE CONNECT OR "ٔ"
No.
Fig. 1. T he law of "Fasl and Wasl" in Arabic rhetoric system
The following six examples show each type of " ٔ", with its
significant meaning, when used to connect two sentences.
A. Waw1" ‫" ٔانقعى‬:
(1).‫األظبررح يؼهًٌٕ انزاليير انؼهى ٔانفعيهخ ٔهللا إَٓى نيقديٌٕ ػًال ً ػظيًب ً نأليخ‬
[Professors teach students sciences and virtue, I swear to
God, they have done a great mission for their nation]
In text (1), the "ٔ" along with "‫ "هللا‬give the meaning of
testimony.
11
1
2
3
Type of
"ٔ"
‫انقعى‬
‫زة‬
‫االظزئُبف‬
4
5
6
112701-8989 IJECS-IJENS © February 2011 IJENS
‫انحبل‬
‫انًؼيخ‬
‫انؼطف‬
Class:
Fasl /
Wasl
M eaning of "ٔ"
Swear by God or testimony
Few or little
It signals to adhere a sentence to
its preceding one if the two
sentences are not related in their
meanings.
Adverb of state
Object of accompaniment
Conjunction of two sentences
I J ENS
Fasl
Fasl
Fasl
Wasl
Wasl
Wasl
International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01
We notices that this text contains the " ٔ" twice but we are
concerned with the one which connects sentences not words.
III. RELAT ED W ORK
The contemporary research on sentence segmentation is
driven by approaches that depend on the purpose of text
segmentation. Approaches include topic identification,
reference table, statistics, syntax, and semantics. The most
notable work to the present work, is reported in 2008 by Ameur
A. Touir, et. Al. in [5]. They developed an empirical technique
for Arabic sentence segmentation based on, the connecting
words between sentences as these are usually used by Arabic
writers in known literature. Their approach can be considered
as semantic and cue phrase based approach. In [5], they
introduced a new notion called active and passive connectors.
Their technique depends on these active and passive
connectors. Active connectors are words that indicate the
beginning or the end of a sentence, or a complete sentence.
Passive connectors do not indicate a new segment, an end of a
segment, or a complete segment by themselves, rather they
come with active connectors, which contribute in determining
the position of the start or the end of the segments. The
limitation of this technique comes from the fact that some
active connectors might appear in other texts, as passive, and
also because it is impractical to collect all possible active
connectors. Furthermore, their technique is not based on the
Arabic rhetoric methods [10], [15], and [16]. Some text
segmentation methods are topic based, where each part of the
text addresses a certain topic. In fact, this approach can
segment a text into paragraphs rather than sentences. Work
along these lines is carried out by Lamprier et al [11], using
genetic algorithms, M. Magimai.-Doss et. al [14], using an
entropy measure technique, etc.. Another approach is based
on a reference table [3], as the potential segments that fit under
the reference table attributes are identified, and then added to
the table. Moreover, s tatistical approaches are extensively
used in [6], [10], and [20]. In the work of Le Thanh et al. [12],
the text is segmented into elementary discourse units , based
on syntactic information and cue phrase. Cristea et al [7],
utilized segmentation based on discourse structure for the
purpose of text summarization. On the other hand, Palmer and
Hearst [19], described a system using the syntactic context of a
potential sentence boundary to classify the boundary. Other
approaches used regular expressions , augmented with
linguistic knowledge about abbreviations to detect boundaries
[21].
IV. BUILDING AN A RABIC CORPUS FOR DISCOURSE
SENT ENCE SEGMENT AT ION
The need to develop an Arabic s entence segmentation
method led us to recognize the importance of having a corpus
to train the system and test its performance. There are some
efforts in creating discourse corpora in different languages as
the Penn Discourse Treebank (PDTB) for English, which is
annotated for discourse connectives, the relations they
convey, and their arguments. It has also been shown to be
extensible to other languages such as Hindi, Turkish and
12
Chinese. Recently, a similar effort is done to Modern Standard
Arabic by producing the Leeds Arabic Discourse Treebank
(LADTB), but, unfortunately, it is not released for researchers
yet [4]. For this reason, it was compulsory to develop an
Arabic discourse corpus. This new corpus is restricted only for
studying the connector " ٔ". To accomplish this job, some
discourses are collected from Arabic newspapers and books .
Some necessary preprocessing is performed. The corpus
structure is a table like, that has two parts; a header part and an
annotation part. The header part contains , the position and the
type of each "ٔ" occasion, whereas the annotation part
contains 22 columns of features . The preprocessing and
extracting the features, for each type of, "ٔ", are explained with
more details in section V, subsection A .
V. THE A RABIC SENT ENCE SEGMENT AT ION M ET HOD
This proposed Arabic Sentence Segmentation Method is
semantic based, depends on the role of the connector " ٔ ", in
Arabic language. According to the meaning of the "ٔ", the
technique can decide on segmentation places in a text. There
are six types of " ٔ"; classified into two classes, "Fasl" and
"Wasl". Each class contains three types of " ٔ " as shown in
Table I, according to Arabic rhetorical linguists [2]. Thus, the
class "Fasl" is used as a sentence boundary detector on every
occasion of it while the types of " ٔ ", that constitutes class
"Wasl", do not have a segmentation effect. During the learning
stage, syntactic and semantic features for each occasion of " ٔ"
are extracted manually. In testing phase, we use the supervised
machine learning model. For that, we provided the Support
Vector Machine (SVM) with the features of each " ٔ". Then, the
learned SVM model is used in recognizing the type of "ٔ"
which is, in turn, used as sentence boundary. Although the
connector "ٔ", is not the only indicator of sentence
boundaries, our method ignores other indicators, such as
punctuation marks, cue phrases and other connectives as "ٔ"
is the most common and most ambiguous connector. This
system consists of three steps: 1)Preprocessing, 2)Feature
extraction, and 3) Classification, as illustrated in Fig. 2.
A. Preprocessing
Step1: Diacretization:
In Arabic, the part of speech is determined by diacretization
marks which are added at the end of each word. Often, writers
neglect adding these marks , and let the reader guesses the
proper diacretization during reading. Diacretization marks are
compulsory for understanding Arabic. Hence, we added
diacretization marks manually for both training and testing the
texts during the preparation of the corpus .
Step2: Discriminate the connector "ٔ" from the letter "ٔ":
In Arabic typing, the connector "ٔ" is typed closed to its
successive word, without separating them by a space.
Looking to this example: "‫[ "ٔقبل انسجم‬and the man said],
the connector "ٔ", is directly typed after the verb, "‫"قبل‬,
without putting a space. In turn, some words start with a letter,
"ٔ", for example, the word "‫[ "ٔجد‬found]. The "ٔ" in, " ‫ٔقبل‬
‫"انسجم‬, acts as a connector, whereas the "ٔ" in, "‫"ٔجد‬, is a part
112701-8989 IJECS-IJENS © February 2011 IJENS
I J ENS
International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01
of word "‫"ٔجد‬. During the second step, the confusion between
the connector "ٔ" and the letter "ٔ" is removed.
B. Feature Extraction:
During feature extraction stage, the syntactic and semantic
features of each type of the connector "ٔ" are manually
extracted. This analysis is built on Arabic rhetorical methods
[2]. It is found that twenty two features are required to
distinguish each type of the "ٔ". The feature sets named; X1,
X2, …, X22; and the elements of each set are listed in Table II.
In the following paragraphs, we discuss each feature for every
possible occurrence of each type of the connector "ٔ".
B.1. Feature Extraction of the first type: Waw1 " ‫" ٔانقعى‬
This type of "ٔ", comes before a word such as, "‫ "هللا‬, and it
means "I swear by", as the next word; "‫ ;"هللا‬is the object of
oath or testimony. Normally, the object of oath is the word,
"‫ "هللا‬or any equivalent word. There are two cases for Waw1.
This type of "ٔ" is recognized by its successive word in the
two cases as shown below:
Case 1:
- The successive word is the noun "‫"هللا‬, and
- The end case diacritical mark of the successive word is
"genitive".
Therefore, we conclude the features as follows:
Features are: X1= "‫"هللا‬, and X7 = genitive mark
Case 2:
- The successive word is a noun,
- The end case diacritical mark of the successive word is
"genitive", and
- The successive word must not be a pronoun.
Then, features are: X3= noun, X7= genitive mark , and X16
= no
13
Case1:
- The two sentences before and after Waw3 have different
kinds. In other words, if one of them is a statement sentence,
the other is a subject sentence i.e. imperative, interrogative or a
vocative sentence. The subject sentence in Arabic is called
"Inshaeya ‫"جًهخ إَشبئيخ‬. The feature is: X12 ≠ X13
Case2:
- Normally if the sentence types, before and after Waw3 are
different, i.e., one sentence is nominal and the other is verbal.
In this case, it is preferable to segment the two sentences. The
feature is: X14 ≠ X15
Case3:
- Unless the two sentences are similar in their tenses, the
segmentation of the two sentences is normally expected. The
feature is: X19≠ X20
Case4:
- The two sentences before and after Waw3 have different
verbs and different subjects. Therefore the features are: X21 =
no and X22 = no
B.4. Feature Extraction of the fourth type Waw4 "‫"ٔانحبل‬
This type of "ٔ", comes before an "adverb of state"
sentence. It can be recognized from its successive word. It has
two cases. In Arabic grammar the word that comes after Waw4
should have the following features:
Case1:
- The word after Waw4 is an anaphoric to a noun in the
previous sentence. Hence, the feature is: X16 = yes
Case2:
- The word after Waw4 is, "‫"قد‬, is followed by a verb in the
past tense:
Then, features are: X1= " ‫" قد‬, X10 = verb and X11 = past
tense
B.2. Feature Extraction of the second type: Waw2 " ‫" ٔزة‬
The structure of "ٔ", combined with the word " ‫"زة‬, means
"few of or little". There are two cases for it, the first case
occurs when the word, " ‫ "زة‬appears explicitly, and when the
word " ‫ "زة‬is hidden and it is understood implicitly.
Case 1:
- The next word is the noun " ‫"زة‬, and
- The end case diacritical mark of the next word is accusative.
Then, features are: X1 = " ‫ "زة‬and X7= accusative mark
Case 2:
- The successive word is an unknown noun,
- The successive word diacritical mark end case is genitive,
and
- The end case diacritical mark of the previous word is not
genitive.
And features are:X3= noun, X5=indefinite, X6≠genitive mark
and X7= genitive mark
B.3 Feature Extraction of the third type: Waw3 " ‫" ٔ االظزئُبف‬
This type has no meaning, rather than it joins two unrelated
adhesive sentences. It can be recognized from the features of
the two sentences before and after it. There are four structures
of this type.
B.5 Feature Extraction of the fifth type Waw5 "‫" ٔ انًؼيخ‬:
This type is similar to "object of accompaniment" in English.
It can be recognized by its successive word only. In Arabic
grammar, the word that comes after "‫" ٔانًؼيخ‬, should be an
accusative noun. The following are the features of "‫"ٔانًؼيخ‬: X3
= noun and X7 = accusative mark
B.6. Feature Extraction of the sixth type Waw 6 "‫ "ٔ انؼطف‬:
The function of,"‫"ٔانؼطف‬, or Waw6, is to join two related
nouns, verbs, nominal sentences or two verbal sentences. It
occurs in two cases as follows:
Case1:
- Conjunction of words, nouns or verbs.
Features are: X2 = X3, X6 = X7, and (X4 = X5 or X8 = X9 or
X17 = X18)
Case2:
- Conjunction of sentences , nominal or verbal.
Features are: X12 = X13, X14 = X15, and X19 = X20 and
(X21 = yes or X22 = yes)
112701-8989 IJECS-IJENS © February 2011 IJENS
I J ENS
International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01
T ABLE II
Feature Sets of different types of the connector " ٔ".
Feature set
+Word
Xi
X1
Meaning
Elements
Next word
‫هللا‬
‫زة‬
‫قد‬
P revious word
-Word_P OS
X2
‫اظى‬
‫فؼم‬
‫حسف‬
part of speech
Next word part of
+Word_P OS
X3
‫اظى‬
‫فؼم‬
‫حسف‬
speech
P revious word
-Word_D/I
X4
definite/indefinit
‫يؼسفخ‬
‫َكسح‬
e
Next word
+Word_D/I
X5
definite/indefinit
‫يؼسفخ‬
‫َكسح‬
e
P revious word
-Word_Diacratic X6
end case
‫فزحخ‬
‫ظًخ‬
‫كعسح‬
ٌٕ‫ظك‬
diacritical mark
Next word end
+Word_Diacrati
X7
case diacritical
‫فزحخ‬
‫ظًخ‬
‫كعسح‬
ٌٕ‫ظك‬
c
mark
P revious word
-Word_S/P
X8
‫يفسد‬
ُٗ‫يث‬
‫جًغ‬
singular/plural
Next word
+Word_S/P
X9
‫يفسد‬
ُٗ‫يث‬
‫جًغ‬
singular/plural
Next next word
++Word_P OS
X10
‫اظى‬
‫فؼم‬
‫حسف‬
part of speech
Next next word
‫يعبز‬
++Word_Tense
X11
‫يبض‬
‫أيس‬
tense
‫ع‬
P revious
-Sentence_Mode X12
‫خجس‬
‫إَشبء‬
sentence mode
Next sentence
+Sentence_Mode X13
‫خجس‬
‫إَشبء‬
mode
P revious
-Sentence_Type
X14
‫اظًيخ‬
‫فؼهيخ‬
sentence type
Next sentence
+Sentence_Type X15
‫اظًيخ‬
‫فؼهيخ‬
type
Whether the next
is pronoun
+Word_Is_Anap
X16
(refers to a word
‫َؼى‬
‫ال‬
hored_P ronoun
in the previous
sentence) or not
P revious word
-Word_M/F
X17
‫يركس‬
‫يؤَث‬
male/female
Next word
+Word_M/F
X18
‫يركس‬
‫يؤَث‬
male/female
P revious
‫يعبز‬
-Sentence_Tense X19
‫يبض‬
‫أيس‬
sentence tense
‫ع‬
Next sentence
‫يعبز‬
+Sentence_Tense X20
‫يبض‬
‫أيس‬
‫ع‬
C. Support Vectortense
Machine (SVM): Training
and
Same sentence
Sentence_Event
ClassificationX21 event before and
‫َؼى‬
‫ال‬
_B&A
The experimentsafterof our technique are implemented with
Same subjects are multiclass
Support Vector Machine,
SVM
version 2.20 developed
Sentence_Subjec
the same in
‫َؼى‬
‫ال‬
by Thorsten X22
Joachims
[18].
t_B&A
sentences
beforeWe used 22 feature sets that
and
after
represent the input of the SVM classifier and 6 classes that
represent the output. Features are denoted: X1, X2, …, X22.
Classes are Waw1, Waw2, …, Waw6.
Arabic
Corpus
Feature
Extraction
T raining phase
SVM
Learning
Testing phase
Unsegmented
text: T
SVM
Classifier
Segmented
sentences of T
14
VI. EXPERIMENT AND RESULT S
The Corpus of Arabic Discourse Sentence Segmentation,
designed within this work, is incorporated in this experiment.
We used 1200 instances for training, and 293 instances for
testing. Class Waw5 "‫"ٔانًؼيخ‬, did not appear neither in
training, nor in testing because it is seldomly used in the
Modern standard Arabic. Classes 1, 2 and 4 appeared in a few
number of instances. It could be said that the experiment is
actually done with only two classes , i.e., 3 and 6, which
represent Waw3 "‫ "ٔاالظزئُبف‬and Waw6 "‫ "ٔانؼطف‬respectively.
Table III summarizes the result of our experiment with precision
recall measure.
T ABLE III
Overall precision/recall measure of classifying the connector "ٔ"
Waw No. of
T yp
occurrences
e
in testing
‫ انقعى‬10
‫ زة‬4
‫ االظزئُب‬94
‫ف‬
‫ انحبل‬6
‫ انًؼيخ‬0
‫ انؼطف‬179
T otal =
293
No. of
occurrences
in Prediction
10
3
94
Precision %
Recall %
100%
100%
98.94%
100%
75%
98.94%
7
0
179
T otal = 293
85.71%
99.44%
Avg.=96.82
%
100%
99.44%
Avg.=94.68
%
As mentioned before, the three types of " ٔ", Waw1 "‫"ٔانقعى‬,
Waw2 "‫ "ٔزة‬and Waw3 "‫"ٔاالظزئُبف‬, act as a segmentation
indicator whereas the other three types of " ٔ", Waw4 "‫"ٔ انحبل‬,
Waw5 "‫ "ٔانًؼيخ‬and Waw6 "‫"ٔانؼطف‬, do not act as
segmentation indicator. Therefore, we can combine them in
two classes only: Fasl and Wasl.
The results shown in TABLE III indicate clearly that, among
the 293 instances of the connector "ٔ", there are 290 correct
and 3 incorrect instances. One incorrect instance of Waw3 is
predicted as Waw6, one incorrect instance of Waw6 is
predicted as Waw3, and one incorrect instance of Waw2 is
predicted as Waw4. Accordingly, the segmentation accuracy
can be computed as:
True_Fasl = ∑ True instances of Wawi for i= 1 to 3 (1)
True_Wasl = ∑ True instances of Wawj , for j= 4 to 6 (2)
Segmentation Accuracy = True_Fasl + True_Wasl (3)
Total number of instances
= (10+3+ 93) + (6+0+178) = 98.98 %
293
Although we only addressed the most tough part of the
problem, the ambiguous connector "ٔ", our results is still
better than that of the method that depends on identifying
Active and Passive connectors [5] which is the only
comparable work in Arabic text segmentation yet. Moreover, if
we would have addressed other connectors su ch as
punctuations and cue phrases as segmentation indicators as
that have been done in [5] , we would have reached higher
accuracy. Also we could get higher accuracy, if we enlarge the
number of instance of the learning phase.
Comparing with the Active/Passive method, our method is able
to segment the following sentence into two segments at the
Fig. 2 The Arabic sentence Segmentation System
112701-8989 IJECS-IJENS © February 2011 IJENS
I J ENS
International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01
position of "ٔ", while the Active/Passive method can detect a
sentence boundary at this position because the word " ‫ "رقٕو‬is
not a one of the connective list.
.‫ ير ْت األثُبء إنٗ انًدزظخ في انصجبح ورقٕو األو ثئػداد غؼبو انغراء‬[Children go to school in the morning and the mother prepares
the lunch].
It is impractical to count all passive connectors in Arabic
language. Therefore, our proposed method surpasses the
active passive method by considering only the connector "ٔ" ,
along with the proposed 22 features mentioned in TABLE II.
REFERENCES
[1] Abdulaziz Ateque, "Elm Al-Maany". Dar Al-Nahda Al-Arabeia for
Publishing. Egypt, 2009, Published in Arabic.
[2] P. Abduquader Hussien, "Athar Al-Nohat fi Al-Bahth Al-Balaghy",
Dar Nahdat Misr for Printing and Publishing , Egypt, 1984,
Published in Arabic
[3] Agichtein, E. and V. Ganti, "Mining reference tables for automatic
text segmentation". Proceedings of the ACM International
Conference on Knowledge Discovery and Data Mining
(SIGKDD'04), Seattle, Washington, USA, ACM Press, 2004, pp
20-29.
[4] Amal Al-Saif, Katja Markert , "T he Leeds Arabic Discourse
T reebank: Annotating Discourse Connectives for Arabic". LREC
2010 Proceedings,2010.
[5] Ameur A. T ouir, Hassan Mathkour and Waleed Al-Sanea,
"Semantic-Based Segmentation of Arabic T exts". Information
Technology Journal, Asian Network for Scientific Information,
2008, (7):pp.1009-1015.
[6] Beeferman, D., A. Berger and J. D. Lafferty, "Statistical models for
text segmentation", Mach. Learning, 1999, 34:pp.177-210.
[7] Cristea, D., O. Postolache and L. Pistol, "Summarization T hrough
Discourse Structure". Comput. Linguistics and Intelligent Text
Processing, Lecture Notes in Computer Science, Springer, Berlin,
Heidelberg, Germany, 2005, Vol. 3406, pp.632-644.
[8] David Crystal, "T he Cambridge Encyclopedia of Language".
Cambridge University Press, New York, 1987.
[9] Fredrik Jørgensen, ”Clause Boundary Detection in T ranscribed
Spoken Language", Joakim Nivre, Heiki-Jaan Kaalep, Kadri
Muischnek and Mare Koit (Eds.) NODALIDA 2007 Conference
Proceedings, 2007, pp.235-239.
[10] Golcher, F., "Statistical text segmentation with partial structure
analysis". Proceeding (KONVENS 2006), Konstanz, Denmark,
2006, pp.44-51.
[11] Lamprier, S., T . Amghar, B. Levrat and F. Saubion, "SegGen: A
genetic algorithm for linear text segmentation. Proceedings of the
20th International Joint Conference on Artificial Intelligence,
Hyderabad, India, January 2007, pp.1647-1653.
[12] Le T hanh, H., G., Abeysinghe and C. Huyck, "Automated discourse
segmentation by syntactic information and cue phrases".
Proceedings of AIA 2004, Innsbruck, Austria, 2004, pp.411-415.
[13] M T aboada, LH Zabala, "Deciding on Units of Analysis within
Centering T heory", Corpus Linguistics and Linguistic Theory,
2008, 4, pp.3-108.
[14] M. Magimai.-Doss1, D. Hakkani-T ¨ur1, O¨. C¸ etin1, E.
Shriberg1,2, J. Fung1, N. Mirghafori, "ENT ROPY BASED
CLASSIFIER
COMBINAT ION
FOR
SENT ENCE
SEGMENT AT ION", IEEE, 2007.
[15] Marcu, D., "T he rhetorical parsing of unrest ricted texts: A surfacebased approach". Com put. Linguistics, 2000, 26: pp. 395-448.
[16] Marcu, D., "T he T heory and Practice of Discourse Parsing and
Summarization". 1st Edn. The MIT Press, 2000, UK.
15
[17] Mostafa Hemeida, "Nedhum Al-Ertebat wa Al-Rabt fi T arkeeb AlGomla Al-Arabeia". The Egyptian Int.Company for Pub.
(Longman), Egypt, 1997, Published in Arabic.
[18] Multiclass Support vector machine. Available: http://svmlight.
Joachims.org/svm_multiclass.html
[19] Palmer, D., and Hearst, M., "Adaptive Multilingual Sentence
Boundary Disambiguation", Computational Linguistics, 1997, 23
(2), pp.241-267.
[20] Utiyama, M. and H. Isahara, "A statistical model for domainindependent text segmentation". Proceedings of the 39th Annual
Meeting of the Association for Comp. Linguistics and 10th
Conference of the European Chapter of the ACL2001 , T oulouse,
France, 2001, pp. 91-498.
[21] Walker, D.J., Clements D.E., Darwin M. and Amtrup W.,
"Sentence Boundary Detection: A Comparison of Paradigms for
Improving MT Quality", In Proceedings of the 8th Machine
Translation Summit, Santiago de Compostela, Spain, 2001, pp.369372.
[22] Yang, C.C. and K.W.LI, "A heuristic method based on a statistical
approach for Chinese text segmentation", J. Am. Soc. Inform. Sci.
Technol., 2005, 56:pp.1438-1447.
Appendix A: Arabic to English translation of Arabic terms
used in this paper.
Arabic
term
‫اظى‬
‫االظزئُبف‬
‫انحبل‬
.‫انخ‬
‫انؼطف‬
‫انقعى‬
‫هللا‬
‫انًؼيخ‬
‫أيس‬
‫ثى‬
‫جًغ‬
‫جًهخ اظًيخ‬
‫جًهخ إَشبئيخ‬
‫جًهخ خجسيخ‬
‫جًهخ فؼهيخ‬
112701-8989 IJECS-IJENS © February 2011 IJENS
‫حسف‬
‫زة‬
ٍ‫شي‬
ٌٕ‫ظك‬
‫ظًخ‬
English
translation
noun
resume
adverb of state
etc.
conjunction
swear
God
object of
accompanimen
t
imperative
next
plural
nominal
sentence
subject
sentence
statement
sentence
verbal
sentence
preposition
few
tense
Jussive mark
nominative
mark
Arabic
term
‫ظًيس غبئت‬
‫فـ‬
‫فزحخ‬
)Fasl( ‫فصم‬
‫فؼم‬
‫قد‬
‫كعسح‬
‫ال‬
‫يبض‬
ُٗ‫يث‬
‫يركس‬
‫يعبزع‬
English
translation
pronoun
next
accusative mark
segment
verb
perhaps
genitive mark
no
Past tense
double
male
present tense
‫يؼسفخ‬
known
‫يفسد‬
singular
‫يؤَث‬
female
‫َؼى‬
‫َكسح‬
)Waw ( ٔ
)Wasl( ‫ٔصم‬
yes
unknown
and
unsegment
I J ENS
Download