Arabic Discourse Segmentation Based on Rhetorical

International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01 10 Arabic Discourse Segmentation Based on Rhetorical Methods Iraky Khalifa, Zakareya Al Feki and Abdelfatah Farawila  Abstract — The discourse segmentation problem in Arabic language has not been fully addressed. A technique to segment Arabic discourse into complete sentences is presented. The technique is derived from Arabic Rhetorical system by exploiting the main crucial connector "‫"و‬, as defined by Arabic linguists almost one thousand years ago. This approach categorizes the six known rhetorical types of "‫ "و‬into two classes: segment and unsegment, known as, "Fasl" and "Wasl". S egmentation places are decided according to the type of connector "‫"و‬. A set of twenty two syntactic and semantic features devised from "Fasl and Wasl" rhetorical methods, are chosen to categorize each type of "‫"و‬. The system undergoes the learning and testing stages, using S VM machine learning technique to identify the types of the connector "‫"و‬. An Arabic discourse corpus is particularly developed for this experiment. We achieved results with an accuracy of 97.95% of discourse segmentation. Index Term— Arabic rhetoric methods "Fasl and Wasl", discourse segmentation, , machine learning, Rhetorical S tructure Theory (RS T), S upport Vector Machine (S VM). I. INT RODUCT ION A sentence is the part of a speech or a written discourse that has a complete and independent meaning. Sentence segmentation refers to indentifying sentences in an unstructured text. The process of sentence segmentation is a basic step for discourse analysis processing systems. It is because, any text stream needs to be separated into coherent sentences in order to enable effective automatic analysis, such as information retrieval, summarization, understanding and translation. It is very important to first define what is meant by a complete and independent sentence. Some researchers have defined sentence, as a finite clause that has a complete and independent meaning [13]. The Cambridge Encyclopedia of Language defines a sentence as the largest unit to which syntactic rules apply [8]. All computational linguistic systems Manuscript received January 24, 2011. Arabic Discourse Segmentation Based on Rhetorical Methods. Iraky Khalifa is with the Computer Science Department, Helwan University, Helwan, Egypt. (e-mail: Dr_iraky@hotmail.com). Zakareya Al Feky is with the Arabic Language Department, University of Alexandrai, Alexandria, Egypt (e-mail: ahad66@yahoo.com). Abdelfatah Farawila is with the Computer Science Department, Helwan University, Egypt. (phone: +966-507810636; e-mail: a_farawila@ yahoo.com), (Corresponding author). that encode and analyze discourse texts, such as Rhetorical Structure Theory (RST), need to answer the following question: How to segment a discourse? This question has been answered, to a certain extent; for some languages such as in English, French, Chinese, Polish, Spanish, etc.[22], but a little work in Arabic has been done. This is due to the distinct and unique characteristics of Arabic language. In present study, we introduce a new method of segmenting an Arabic discourse into its sentence units. However, Arabic sentence segmentation processing is deemed hard due to two main difficulties: the lack of an Arabic corpus dedicated for sentence segmentation, and the very special nature of Arabic language. An Arabic Corpus is developed, particularly for the training and testing the segmentation experiments in this study. The proposed segmentation method is syntactic/semantic based, and it comprises two ideas: the Arabic rhetorical methods ; "Fasl and Wasl"; of discourse segmentation as defined by Arab linguists, and the supervised machine learning with Support Vector Machine (SVM). It is realized that the connector " ٔ /and/Waw" is the most ambiguous connector due to its mostly rhetorical use [4]. In the Arabic rhetoric system, the meaning of "ٔ" plays a great role of understanding consecutive sentences , and in turn determines the places of sentence endings [1]. Historically, this problem was addressed long time ago, by a prominent Arabic linguist, "Abdel Quaher Al-Jorjany ( ‫(ػجدانقبْس انجسجبَي‬, died in 471 Higri". In his book "Dalael Al Eegaaz ‫"دالئم اإلػجبش‬, he defines an approach, called "Fasl and Wasl", which means, "identifying segmentation places in a text" [17]. This approach, identifies sentence ending places by understanding the meaning of the connector " ٔ" rather than other sentence connectors such as: "‫ انخ‬... ‫ ثى‬, ‫"فـ‬, because their functions as a sentence separator are evidently known [1]. In this paper, we use "ٔ" and "Waw" interchangeably to denote the connector "ٔ", and similarly, we use "Fasl" and "Segment", and "Wasl" and "Unsegment" respectively. According to "Fasl and Wasl" rules; there are six different meanings of " ٔ", three of these signal to a segmenting place; i.e., "Fasl"; whereas the other three types are used when the context implies connecting the text before and after it, i.e., "Wasl' or Unsegment. Table I describes these six types of " ٔ", and their segmentation effects. The proposed method consists of two phases: 1) training; which characterizes the feature of each " ٔ ", and 2) testing. Support Vector Machine (SVM) is used during both, training and testing phases. The significance of the proposed approach is that it is built on the well established Arabic rhetoric segmentation rules, "Fasl and Wasl ‫[ "انفصم ٔانٕصم‬17]. 112701-8989 IJECS-IJENS © February 2011 IJENS I J ENS International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01 This paper is organized into seven sections. Section II describes the rules of sentence segmentation in Arabic rhetoric system. Section III surveys some related work. Section IV presents a brief account of the development of the proposed Arabic Corpus. Section V explains the proposed Arabic text segmentation technique. Section VI gives experimental results along with some discussions. Final section of this paper concludes it. Because the paper contains some Arabic words and terms, which may cause some difficulties for non Arabic speakers, an Appendix is added at the end of this paper to translate Arabic terms mentioned in this paper into English. II. TYPES OF T HE CONNECT OR "ٔ" IN "FASL AND W ASL" A RABIC RHET ORICS The law of "Fasl and Wasl"; as defined by "Abdel Quaher Aljorjany ‫"ػجدانقبْس انجسجبَي‬, is shown below in Fig. 1. It is interpreted, thereafter by the Arabic linguists when they related the segmentation places of "Fasl and Wasl" to the meaning of the "ٔ". There are six types of connector, "ٔ" in terms of meaning [2]. They are clustered into two classes: "Fasl" or "Wasl". The class "Fasl" contains three types of, "ٔ": 1) Waw1:"‫"ٔانقعى‬, 2) Waw2:" ‫" ٔزة‬, and 3)Waw3:" ‫"ٔاالظزئُبف‬. The second class, "Wasl", contains the rest three types of, "ٔ": 4)Waw4:"‫"ٔانحبل‬, 5)Waw5:"‫"ٔانًؼيخ‬, and 6)Waw6: "‫[ "ٔانؼطف‬2]. These six types of connector, "ٔ", their names, meanings, and the class which each type belongs to, are shown in Table I. :‫انجًهخ نٓب ثالثخ أظسة‬ ّ‫ جًهخ حبنٓب يغ انزٗ قجهٓب حبل انصفخ يغ انًٕصٕف ٔ ثبنزأكيد يغ انًؤكد فال يكٌٕ فيٓب انؼطف انجزخ نشج‬- 1 .ّ‫ ثؼطف انشئ ػهٗ َفع‬- ‫ نٕ ػطفذ‬- ‫انؼطف فيٓب‬ ‫ جًهخ حبنٓب يغ انزٗ قجهٓب حبل االظى يكٌٕ غيس انرٖ قجهّ إال أَّ يشبزكّ فٗ حكى ٔ يدخم يؼّ فٗ يؼُٗ يثم‬- 2 .‫أٌ يكٌٕ كال االظًيٍ فبػال أٔ يفؼٕال أٔ يعبفب إنيّ فيكٌٕ حقٓب انؼطف‬ ٌٕ‫ جًهخ نيعذ فٗ شئ يٍ انحبنيٍ ثم ظجيهٓب يغ انزٗ قجهٓب ظجيم االظى يغ االظى ال يكٌٕ يُّ فٗ شئ فال يك‬- 3 ‫إيبِ ٔ ال يشبزكب نّ فٗ يؼُٗ ثم ْٕ شئ إٌ ذكس نى يركس إال ثأيس يُفسد ثّ ٔ يكٌٕ ذكس انرٖ قجهّ ٔ رسك انركس‬ .‫ظٕاء فٗ حبنّ نؼدو انزؼهق ثيُّ ٔ ثيُّ زأظبً ٔ حق ْرا رسك انؼطف انجزخ‬ Sentences fall in three types : 1- A sentence describes its predecessor being as an adjective of a noun. So, a conjunction is never used as it can be used as a semi conjunction if we consider two sentences one describing the other as conjunction. 2- A sentence following its preceding sentence is like a noun different from its preceding noun but both share a position and a meaning like a situation where the two names are subjects, objects or attaché. 3- A sentence different from both cases above as its position with the preceding one is the same as a noun to a noun completely different, not being the same or sharing a meaning but it is something, if mentioned, it is mentioned uniquely. In this case mentioning or non mentioning of the previous sentence is the same as there is no relation whatsoever. This implies no conjunction at all as dropping conjunction is either for connection to reach the meaning or disconnection to reach the meaning, and conjunction is for means between the two cases and it has a situation between two situations. B. Waw 2 "‫"ٔزة‬: ‫انشجبة نيعٕ ا ٔحدْى انريٍ يؼبٌَٕ ثم إٌ أشيبرٓى جصء يٍ أشيبد انًجزًغ كهّ ٔزة‬ (2) ‫ نًبذا زكصرى ػهٗ "انشجبة" يٍ ثيٍ غجقبد انًجزًغ؟‬:‫ظبئم يقٕل‬ [Young people are not the only ones who suffer, but their crises are part of the crises of the whole society and someone may ask: Why have focused only on youth only and not on the divisions of the whole society?] In text (2), the "ٔ" along with " ‫ "زة‬give the meaning: few or someone. C. Waw 3 " ‫" ٔاالظزئُبف‬: ٖ‫يؼبَي انًساْقٌٕ يٍ ثؼط انًشكالد انُفعيخ ٔ انًجزًغ ػبيخ ثّ ظهجيبد أخس‬ (3) .‫كثيسح‬ [Adolescents suffer from some psychological problems and there are, in general, other numerous problems in the society.] In text (3), the "ٔ" does not indicate any specific meaning, rather than joining two unrelated sentences. In the above three examples, the, "ٔ", refers to segmentation places according to Arabic rhetoric methods. These three types are contained in the class "Fasl". In the other hand, the following three examples, show the other three types of "ٔ" which are contained in the "Wasl" class. They have unsegmenting effect because the meanings before and after the, "ٔ", are related. D. Waw 4 " ‫" ٔانحب ل‬: (4) .‫دخم انًدزض انفصم ْٕٔ يجزعى‬ [The teacher came smiley into the classroom] In (4), the "ٔ" indicates that its sentence "smiley into the classroom" acts as an adverb of state for the previous sentence "The teacher came". E. Waw 5 " ‫" ٔانًؼيخ‬: (5) .‫جهط انحجيجبٌ ٔظٕء انقًس‬ [The couple sat together with the light of the moon] In text (5), the "ٔ" indicates that its following sentence acts as an object of accompaniment for the previous one. F. Waw 6 " ‫" ٔ انؼطف‬: (6) .‫ثدأد اندزاظخ ٔاَزظى انًؼهًٌٕ ٔ انطالة في انًدازض‬ [The study started and students and teachers enrolled in schools] In (6), the "ٔ" is a conjunction of related words or sentences. T ABLE I T YPES OF T HE CONNECT OR "ٔ" No. Fig. 1. T he law of "Fasl and Wasl" in Arabic rhetoric system The following six examples show each type of " ٔ", with its significant meaning, when used to connect two sentences. A. Waw1" ‫" ٔانقعى‬: (1).‫األظبررح يؼهًٌٕ انزاليير انؼهى ٔانفعيهخ ٔهللا إَٓى نيقديٌٕ ػًال ً ػظيًب ً نأليخ‬ [Professors teach students sciences and virtue, I swear to God, they have done a great mission for their nation] In text (1), the "ٔ" along with "‫ "هللا‬give the meaning of testimony. 11 1 2 3 Type of "ٔ" ‫انقعى‬ ‫زة‬ ‫االظزئُبف‬ 4 5 6 112701-8989 IJECS-IJENS © February 2011 IJENS ‫انحبل‬ ‫انًؼيخ‬ ‫انؼطف‬ Class: Fasl / Wasl M eaning of "ٔ" Swear by God or testimony Few or little It signals to adhere a sentence to its preceding one if the two sentences are not related in their meanings. Adverb of state Object of accompaniment Conjunction of two sentences I J ENS Fasl Fasl Fasl Wasl Wasl Wasl International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01 We notices that this text contains the " ٔ" twice but we are concerned with the one which connects sentences not words. III. RELAT ED W ORK The contemporary research on sentence segmentation is driven by approaches that depend on the purpose of text segmentation. Approaches include topic identification, reference table, statistics, syntax, and semantics. The most notable work to the present work, is reported in 2008 by Ameur A. Touir, et. Al. in [5]. They developed an empirical technique for Arabic sentence segmentation based on, the connecting words between sentences as these are usually used by Arabic writers in known literature. Their approach can be considered as semantic and cue phrase based approach. In [5], they introduced a new notion called active and passive connectors. Their technique depends on these active and passive connectors. Active connectors are words that indicate the beginning or the end of a sentence, or a complete sentence. Passive connectors do not indicate a new segment, an end of a segment, or a complete segment by themselves, rather they come with active connectors, which contribute in determining the position of the start or the end of the segments. The limitation of this technique comes from the fact that some active connectors might appear in other texts, as passive, and also because it is impractical to collect all possible active connectors. Furthermore, their technique is not based on the Arabic rhetoric methods [10], [15], and [16]. Some text segmentation methods are topic based, where each part of the text addresses a certain topic. In fact, this approach can segment a text into paragraphs rather than sentences. Work along these lines is carried out by Lamprier et al [11], using genetic algorithms, M. Magimai.-Doss et. al [14], using an entropy measure technique, etc.. Another approach is based on a reference table [3], as the potential segments that fit under the reference table attributes are identified, and then added to the table. Moreover, s tatistical approaches are extensively used in [6], [10], and [20]. In the work of Le Thanh et al. [12], the text is segmented into elementary discourse units , based on syntactic information and cue phrase. Cristea et al [7], utilized segmentation based on discourse structure for the purpose of text summarization. On the other hand, Palmer and Hearst [19], described a system using the syntactic context of a potential sentence boundary to classify the boundary. Other approaches used regular expressions , augmented with linguistic knowledge about abbreviations to detect boundaries [21]. IV. BUILDING AN A RABIC CORPUS FOR DISCOURSE SENT ENCE SEGMENT AT ION The need to develop an Arabic s entence segmentation method led us to recognize the importance of having a corpus to train the system and test its performance. There are some efforts in creating discourse corpora in different languages as the Penn Discourse Treebank (PDTB) for English, which is annotated for discourse connectives, the relations they convey, and their arguments. It has also been shown to be extensible to other languages such as Hindi, Turkish and 12 Chinese. Recently, a similar effort is done to Modern Standard Arabic by producing the Leeds Arabic Discourse Treebank (LADTB), but, unfortunately, it is not released for researchers yet [4]. For this reason, it was compulsory to develop an Arabic discourse corpus. This new corpus is restricted only for studying the connector " ٔ". To accomplish this job, some discourses are collected from Arabic newspapers and books . Some necessary preprocessing is performed. The corpus structure is a table like, that has two parts; a header part and an annotation part. The header part contains , the position and the type of each "ٔ" occasion, whereas the annotation part contains 22 columns of features . The preprocessing and extracting the features, for each type of, "ٔ", are explained with more details in section V, subsection A . V. THE A RABIC SENT ENCE SEGMENT AT ION M ET HOD This proposed Arabic Sentence Segmentation Method is semantic based, depends on the role of the connector " ٔ ", in Arabic language. According to the meaning of the "ٔ", the technique can decide on segmentation places in a text. There are six types of " ٔ"; classified into two classes, "Fasl" and "Wasl". Each class contains three types of " ٔ " as shown in Table I, according to Arabic rhetorical linguists [2]. Thus, the class "Fasl" is used as a sentence boundary detector on every occasion of it while the types of " ٔ ", that constitutes class "Wasl", do not have a segmentation effect. During the learning stage, syntactic and semantic features for each occasion of " ٔ" are extracted manually. In testing phase, we use the supervised machine learning model. For that, we provided the Support Vector Machine (SVM) with the features of each " ٔ". Then, the learned SVM model is used in recognizing the type of "ٔ" which is, in turn, used as sentence boundary. Although the connector "ٔ", is not the only indicator of sentence boundaries, our method ignores other indicators, such as punctuation marks, cue phrases and other connectives as "ٔ" is the most common and most ambiguous connector. This system consists of three steps: 1)Preprocessing, 2)Feature extraction, and 3) Classification, as illustrated in Fig. 2. A. Preprocessing Step1: Diacretization: In Arabic, the part of speech is determined by diacretization marks which are added at the end of each word. Often, writers neglect adding these marks , and let the reader guesses the proper diacretization during reading. Diacretization marks are compulsory for understanding Arabic. Hence, we added diacretization marks manually for both training and testing the texts during the preparation of the corpus . Step2: Discriminate the connector "ٔ" from the letter "ٔ": In Arabic typing, the connector "ٔ" is typed closed to its successive word, without separating them by a space. Looking to this example: "‫[ "ٔقبل انسجم‬and the man said], the connector "ٔ", is directly typed after the verb, "‫"قبل‬, without putting a space. In turn, some words start with a letter, "ٔ", for example, the word "‫[ "ٔجد‬found]. The "ٔ" in, " ‫ٔقبل‬ ‫"انسجم‬, acts as a connector, whereas the "ٔ" in, "‫"ٔجد‬, is a part 112701-8989 IJECS-IJENS © February 2011 IJENS I J ENS International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01 of word "‫"ٔجد‬. During the second step, the confusion between the connector "ٔ" and the letter "ٔ" is removed. B. Feature Extraction: During feature extraction stage, the syntactic and semantic features of each type of the connector "ٔ" are manually extracted. This analysis is built on Arabic rhetorical methods [2]. It is found that twenty two features are required to distinguish each type of the "ٔ". The feature sets named; X1, X2, …, X22; and the elements of each set are listed in Table II. In the following paragraphs, we discuss each feature for every possible occurrence of each type of the connector "ٔ". B.1. Feature Extraction of the first type: Waw1 " ‫" ٔانقعى‬ This type of "ٔ", comes before a word such as, "‫ "هللا‬, and it means "I swear by", as the next word; "‫ ;"هللا‬is the object of oath or testimony. Normally, the object of oath is the word, "‫ "هللا‬or any equivalent word. There are two cases for Waw1. This type of "ٔ" is recognized by its successive word in the two cases as shown below: Case 1: - The successive word is the noun "‫"هللا‬, and - The end case diacritical mark of the successive word is "genitive". Therefore, we conclude the features as follows: Features are: X1= "‫"هللا‬, and X7 = genitive mark Case 2: - The successive word is a noun, - The end case diacritical mark of the successive word is "genitive", and - The successive word must not be a pronoun. Then, features are: X3= noun, X7= genitive mark , and X16 = no 13 Case1: - The two sentences before and after Waw3 have different kinds. In other words, if one of them is a statement sentence, the other is a subject sentence i.e. imperative, interrogative or a vocative sentence. The subject sentence in Arabic is called "Inshaeya ‫"جًهخ إَشبئيخ‬. The feature is: X12 ≠ X13 Case2: - Normally if the sentence types, before and after Waw3 are different, i.e., one sentence is nominal and the other is verbal. In this case, it is preferable to segment the two sentences. The feature is: X14 ≠ X15 Case3: - Unless the two sentences are similar in their tenses, the segmentation of the two sentences is normally expected. The feature is: X19≠ X20 Case4: - The two sentences before and after Waw3 have different verbs and different subjects. Therefore the features are: X21 = no and X22 = no B.4. Feature Extraction of the fourth type Waw4 "‫"ٔانحبل‬ This type of "ٔ", comes before an "adverb of state" sentence. It can be recognized from its successive word. It has two cases. In Arabic grammar the word that comes after Waw4 should have the following features: Case1: - The word after Waw4 is an anaphoric to a noun in the previous sentence. Hence, the feature is: X16 = yes Case2: - The word after Waw4 is, "‫"قد‬, is followed by a verb in the past tense: Then, features are: X1= " ‫" قد‬, X10 = verb and X11 = past tense B.2. Feature Extraction of the second type: Waw2 " ‫" ٔزة‬ The structure of "ٔ", combined with the word " ‫"زة‬, means "few of or little". There are two cases for it, the first case occurs when the word, " ‫ "زة‬appears explicitly, and when the word " ‫ "زة‬is hidden and it is understood implicitly. Case 1: - The next word is the noun " ‫"زة‬, and - The end case diacritical mark of the next word is accusative. Then, features are: X1 = " ‫ "زة‬and X7= accusative mark Case 2: - The successive word is an unknown noun, - The successive word diacritical mark end case is genitive, and - The end case diacritical mark of the previous word is not genitive. And features are:X3= noun, X5=indefinite, X6≠genitive mark and X7= genitive mark B.3 Feature Extraction of the third type: Waw3 " ‫" ٔ االظزئُبف‬ This type has no meaning, rather than it joins two unrelated adhesive sentences. It can be recognized from the features of the two sentences before and after it. There are four structures of this type. B.5 Feature Extraction of the fifth type Waw5 "‫" ٔ انًؼيخ‬: This type is similar to "object of accompaniment" in English. It can be recognized by its successive word only. In Arabic grammar, the word that comes after "‫" ٔانًؼيخ‬, should be an accusative noun. The following are the features of "‫"ٔانًؼيخ‬: X3 = noun and X7 = accusative mark B.6. Feature Extraction of the sixth type Waw 6 "‫ "ٔ انؼطف‬: The function of,"‫"ٔانؼطف‬, or Waw6, is to join two related nouns, verbs, nominal sentences or two verbal sentences. It occurs in two cases as follows: Case1: - Conjunction of words, nouns or verbs. Features are: X2 = X3, X6 = X7, and (X4 = X5 or X8 = X9 or X17 = X18) Case2: - Conjunction of sentences , nominal or verbal. Features are: X12 = X13, X14 = X15, and X19 = X20 and (X21 = yes or X22 = yes) 112701-8989 IJECS-IJENS © February 2011 IJENS I J ENS International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01 T ABLE II Feature Sets of different types of the connector " ٔ". Feature set +Word Xi X1 Meaning Elements Next word ‫هللا‬ ‫زة‬ ‫قد‬ P revious word -Word_P OS X2 ‫اظى‬ ‫فؼم‬ ‫حسف‬ part of speech Next word part of +Word_P OS X3 ‫اظى‬ ‫فؼم‬ ‫حسف‬ speech P revious word -Word_D/I X4 definite/indefinit ‫يؼسفخ‬ ‫َكسح‬ e Next word +Word_D/I X5 definite/indefinit ‫يؼسفخ‬ ‫َكسح‬ e P revious word -Word_Diacratic X6 end case ‫فزحخ‬ ‫ظًخ‬ ‫كعسح‬ ٌٕ‫ظك‬ diacritical mark Next word end +Word_Diacrati X7 case diacritical ‫فزحخ‬ ‫ظًخ‬ ‫كعسح‬ ٌٕ‫ظك‬ c mark P revious word -Word_S/P X8 ‫يفسد‬ ُٗ‫يث‬ ‫جًغ‬ singular/plural Next word +Word_S/P X9 ‫يفسد‬ ُٗ‫يث‬ ‫جًغ‬ singular/plural Next next word ++Word_P OS X10 ‫اظى‬ ‫فؼم‬ ‫حسف‬ part of speech Next next word ‫يعبز‬ ++Word_Tense X11 ‫يبض‬ ‫أيس‬ tense ‫ع‬ P revious -Sentence_Mode X12 ‫خجس‬ ‫إَشبء‬ sentence mode Next sentence +Sentence_Mode X13 ‫خجس‬ ‫إَشبء‬ mode P revious -Sentence_Type X14 ‫اظًيخ‬ ‫فؼهيخ‬ sentence type Next sentence +Sentence_Type X15 ‫اظًيخ‬ ‫فؼهيخ‬ type Whether the next is pronoun +Word_Is_Anap X16 (refers to a word ‫َؼى‬ ‫ال‬ hored_P ronoun in the previous sentence) or not P revious word -Word_M/F X17 ‫يركس‬ ‫يؤَث‬ male/female Next word +Word_M/F X18 ‫يركس‬ ‫يؤَث‬ male/female P revious ‫يعبز‬ -Sentence_Tense X19 ‫يبض‬ ‫أيس‬ sentence tense ‫ع‬ Next sentence ‫يعبز‬ +Sentence_Tense X20 ‫يبض‬ ‫أيس‬ ‫ع‬ C. Support Vectortense Machine (SVM): Training and Same sentence Sentence_Event ClassificationX21 event before and ‫َؼى‬ ‫ال‬ _B&A The experimentsafterof our technique are implemented with Same subjects are multiclass Support Vector Machine, SVM version 2.20 developed Sentence_Subjec the same in ‫َؼى‬ ‫ال‬ by Thorsten X22 Joachims [18]. t_B&A sentences beforeWe used 22 feature sets that and after represent the input of the SVM classifier and 6 classes that represent the output. Features are denoted: X1, X2, …, X22. Classes are Waw1, Waw2, …, Waw6. Arabic Corpus Feature Extraction T raining phase SVM Learning Testing phase Unsegmented text: T SVM Classifier Segmented sentences of T 14 VI. EXPERIMENT AND RESULT S The Corpus of Arabic Discourse Sentence Segmentation, designed within this work, is incorporated in this experiment. We used 1200 instances for training, and 293 instances for testing. Class Waw5 "‫"ٔانًؼيخ‬, did not appear neither in training, nor in testing because it is seldomly used in the Modern standard Arabic. Classes 1, 2 and 4 appeared in a few number of instances. It could be said that the experiment is actually done with only two classes , i.e., 3 and 6, which represent Waw3 "‫ "ٔاالظزئُبف‬and Waw6 "‫ "ٔانؼطف‬respectively. Table III summarizes the result of our experiment with precision recall measure. T ABLE III Overall precision/recall measure of classifying the connector "ٔ" Waw No. of T yp occurrences e in testing ‫ انقعى‬10 ‫ زة‬4 ‫ االظزئُب‬94 ‫ف‬ ‫ انحبل‬6 ‫ انًؼيخ‬0 ‫ انؼطف‬179 T otal = 293 No. of occurrences in Prediction 10 3 94 Precision % Recall % 100% 100% 98.94% 100% 75% 98.94% 7 0 179 T otal = 293 85.71% 99.44% Avg.=96.82 % 100% 99.44% Avg.=94.68 % As mentioned before, the three types of " ٔ", Waw1 "‫"ٔانقعى‬, Waw2 "‫ "ٔزة‬and Waw3 "‫"ٔاالظزئُبف‬, act as a segmentation indicator whereas the other three types of " ٔ", Waw4 "‫"ٔ انحبل‬, Waw5 "‫ "ٔانًؼيخ‬and Waw6 "‫"ٔانؼطف‬, do not act as segmentation indicator. Therefore, we can combine them in two classes only: Fasl and Wasl. The results shown in TABLE III indicate clearly that, among the 293 instances of the connector "ٔ", there are 290 correct and 3 incorrect instances. One incorrect instance of Waw3 is predicted as Waw6, one incorrect instance of Waw6 is predicted as Waw3, and one incorrect instance of Waw2 is predicted as Waw4. Accordingly, the segmentation accuracy can be computed as: True_Fasl = ∑ True instances of Wawi for i= 1 to 3 (1) True_Wasl = ∑ True instances of Wawj , for j= 4 to 6 (2) Segmentation Accuracy = True_Fasl + True_Wasl (3) Total number of instances = (10+3+ 93) + (6+0+178) = 98.98 % 293 Although we only addressed the most tough part of the problem, the ambiguous connector "ٔ", our results is still better than that of the method that depends on identifying Active and Passive connectors [5] which is the only comparable work in Arabic text segmentation yet. Moreover, if we would have addressed other connectors su ch as punctuations and cue phrases as segmentation indicators as that have been done in [5] , we would have reached higher accuracy. Also we could get higher accuracy, if we enlarge the number of instance of the learning phase. Comparing with the Active/Passive method, our method is able to segment the following sentence into two segments at the Fig. 2 The Arabic sentence Segmentation System 112701-8989 IJECS-IJENS © February 2011 IJENS I J ENS International Journal of Electric & Computer Sciences IJECS-IJENS Vol: 11 No: 01 position of "ٔ", while the Active/Passive method can detect a sentence boundary at this position because the word " ‫ "رقٕو‬is not a one of the connective list. .‫ ير ْت األثُبء إنٗ انًدزظخ في انصجبح ورقٕو األو ثئػداد غؼبو انغراء‬[Children go to school in the morning and the mother prepares the lunch]. It is impractical to count all passive connectors in Arabic language. Therefore, our proposed method surpasses the active passive method by considering only the connector "ٔ" , along with the proposed 22 features mentioned in TABLE II. REFERENCES [1] Abdulaziz Ateque, "Elm Al-Maany". Dar Al-Nahda Al-Arabeia for Publishing. Egypt, 2009, Published in Arabic. [2] P. Abduquader Hussien, "Athar Al-Nohat fi Al-Bahth Al-Balaghy", Dar Nahdat Misr for Printing and Publishing , Egypt, 1984, Published in Arabic [3] Agichtein, E. and V. Ganti, "Mining reference tables for automatic text segmentation". Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'04), Seattle, Washington, USA, ACM Press, 2004, pp 20-29. [4] Amal Al-Saif, Katja Markert , "T he Leeds Arabic Discourse T reebank: Annotating Discourse Connectives for Arabic". LREC 2010 Proceedings,2010. [5] Ameur A. T ouir, Hassan Mathkour and Waleed Al-Sanea, "Semantic-Based Segmentation of Arabic T exts". Information Technology Journal, Asian Network for Scientific Information, 2008, (7):pp.1009-1015. [6] Beeferman, D., A. Berger and J. D. Lafferty, "Statistical models for text segmentation", Mach. Learning, 1999, 34:pp.177-210. [7] Cristea, D., O. Postolache and L. Pistol, "Summarization T hrough Discourse Structure". Comput. Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, Germany, 2005, Vol. 3406, pp.632-644. [8] David Crystal, "T he Cambridge Encyclopedia of Language". Cambridge University Press, New York, 1987. [9] Fredrik Jørgensen, ”Clause Boundary Detection in T ranscribed Spoken Language", Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit (Eds.) NODALIDA 2007 Conference Proceedings, 2007, pp.235-239. [10] Golcher, F., "Statistical text segmentation with partial structure analysis". Proceeding (KONVENS 2006), Konstanz, Denmark, 2006, pp.44-51. [11] Lamprier, S., T . Amghar, B. Levrat and F. Saubion, "SegGen: A genetic algorithm for linear text segmentation. Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, January 2007, pp.1647-1653. [12] Le T hanh, H., G., Abeysinghe and C. Huyck, "Automated discourse segmentation by syntactic information and cue phrases". Proceedings of AIA 2004, Innsbruck, Austria, 2004, pp.411-415. [13] M T aboada, LH Zabala, "Deciding on Units of Analysis within Centering T heory", Corpus Linguistics and Linguistic Theory, 2008, 4, pp.3-108. [14] M. Magimai.-Doss1, D. Hakkani-T ¨ur1, O¨. C¸ etin1, E. Shriberg1,2, J. Fung1, N. Mirghafori, "ENT ROPY BASED CLASSIFIER COMBINAT ION FOR SENT ENCE SEGMENT AT ION", IEEE, 2007. [15] Marcu, D., "T he rhetorical parsing of unrest ricted texts: A surfacebased approach". Com put. Linguistics, 2000, 26: pp. 395-448. [16] Marcu, D., "T he T heory and Practice of Discourse Parsing and Summarization". 1st Edn. The MIT Press, 2000, UK. 15 [17] Mostafa Hemeida, "Nedhum Al-Ertebat wa Al-Rabt fi T arkeeb AlGomla Al-Arabeia". The Egyptian Int.Company for Pub. (Longman), Egypt, 1997, Published in Arabic. [18] Multiclass Support vector machine. Available: http://svmlight. Joachims.org/svm_multiclass.html [19] Palmer, D., and Hearst, M., "Adaptive Multilingual Sentence Boundary Disambiguation", Computational Linguistics, 1997, 23 (2), pp.241-267. [20] Utiyama, M. and H. Isahara, "A statistical model for domainindependent text segmentation". Proceedings of the 39th Annual Meeting of the Association for Comp. Linguistics and 10th Conference of the European Chapter of the ACL2001 , T oulouse, France, 2001, pp. 91-498. [21] Walker, D.J., Clements D.E., Darwin M. and Amtrup W., "Sentence Boundary Detection: A Comparison of Paradigms for Improving MT Quality", In Proceedings of the 8th Machine Translation Summit, Santiago de Compostela, Spain, 2001, pp.369372. [22] Yang, C.C. and K.W.LI, "A heuristic method based on a statistical approach for Chinese text segmentation", J. Am. Soc. Inform. Sci. Technol., 2005, 56:pp.1438-1447. Appendix A: Arabic to English translation of Arabic terms used in this paper. Arabic term ‫اظى‬ ‫االظزئُبف‬ ‫انحبل‬ .‫انخ‬ ‫انؼطف‬ ‫انقعى‬ ‫هللا‬ ‫انًؼيخ‬ ‫أيس‬ ‫ثى‬ ‫جًغ‬ ‫جًهخ اظًيخ‬ ‫جًهخ إَشبئيخ‬ ‫جًهخ خجسيخ‬ ‫جًهخ فؼهيخ‬ 112701-8989 IJECS-IJENS © February 2011 IJENS ‫حسف‬ ‫زة‬ ٍ‫شي‬ ٌٕ‫ظك‬ ‫ظًخ‬ English translation noun resume adverb of state etc. conjunction swear God object of accompanimen t imperative next plural nominal sentence subject sentence statement sentence verbal sentence preposition few tense Jussive mark nominative mark Arabic term ‫ظًيس غبئت‬ ‫فـ‬ ‫فزحخ‬ )Fasl( ‫فصم‬ ‫فؼم‬ ‫قد‬ ‫كعسح‬ ‫ال‬ ‫يبض‬ ُٗ‫يث‬ ‫يركس‬ ‫يعبزع‬ English translation pronoun next accusative mark segment verb perhaps genitive mark no Past tense double male present tense ‫يؼسفخ‬ known ‫يفسد‬ singular ‫يؤَث‬ female ‫َؼى‬ ‫َكسح‬ )Waw ( ٔ )Wasl( ‫ٔصم‬ yes unknown and unsegment I J ENS

Arabic Discourse Segmentation Based on Rhetorical

Related documents

Products

Support

Arabic Discourse Segmentation Based on Rhetorical

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib