From: Proceedings of the Eleventh International FLAIRS Conference. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. AnEllipsis Detection Methodbased on a Clause Parser for Arabic Language Kais Haddar, Abdelmajid Ben Hamadou Facult6 des Sciences Economiqueset de Gestion de Sfax Laboratoire de recherche LARIS,B.P. 1(188 Sfax - TUNISIA Abdelmajid.Benhamadou@fsegs.rnu.tn Copyright ©1998, American Association forArtifidal Intelligence (www.aaai.org). Alldghtsreserved. Abstract Fllipsis has alwaysbeena centre of academic interest, be it in linguistic theory or in computational linguistics. In the present paper, we proposeaa ellipsis detection methodtbr the Arabic language using the ATNformalism. "llais lbnnalism, which is easy to implement, ullox~ us to construct an etl~cient clause parser to detect ellipses in complexsentence structures. Then, we give a lbnnal characterisationof ellipsis and a detection methodbasedon an ATNgrammarIbr well-lbrmed clauses. Finally, we present somecriteria to classify’ complexelliptical sentences in order to elaborate specific resolution algoritluns. Introduction Ellipsis has ahvays been a center ofaeademicinterest, be it in linguistic theory or in computational linguistics. A simple exampleof Arabic ellipsis is given in sentence (1): t 31 (I) 7he young boy went to school and his brother lwentl to universi~.’. in this example, the second clanse contains the deletion of a verbal phrase, the meaning of which is to be determined from the first clause. According to Dalrymple terminology (DalD’mple, Shieber and Pereira 1991), call the clause that the verb is copied from the well-formed clause, and the clause which contains the elided verb the elliptical clause. Syntactic accounts of ellipsis posit the copying of syntaclic structure front the well-formed clause representation to the elliptical clause one (Gardent 1992), (Hardt 1993)and (Lappin 1992). In this paper, we take an interest in the automatic treatment of the phenomenonof ellipsis in disconrse written in Arabic language. To identify the well-formed clause in an elliptical sentenceand to localize the elliptical ones, we propose to construct an Arabic grammar that recognises only the structure of well-formedclauses. Also. the grammar can be tightly restricted to define the sentence structure in terms of a small nnmberofwellformedclause structures. The originality of our approachresides in using a clause grammarto detect ellipses of complexsentence stn~ctures 270 Haddar and in establishing a classification ofelliptieal sentences which facilitates the construction of efficient resolution algorithms. The paper is structured as follows: in section 2, we presenl some related works and compare them to our approach. Section 3 is devoted to the presentation of the clause grammar used for the well-formed clause identification. In section 4, we give an ellipsis detection methodbased on a forntal characterisation of elliptical sentences. In section 5, we propose an elliptical sentence classification based on the elliptical sentence s.vntactic structure. The last section presents our conclusions and futures works. Related work Several methodshave been used to treat ellipses in their different forms and in different contexls. Wecile in particular: ¯ The extension of the grammarby adding explicit rules or mctarules that prevent ellipses (Gardent 1992), (Huang1984) attd (Sabah i 989). ¯ The relaxing of constraints: Consistency and ordering constraints can be relaxed on a second pass, allowing lhe detection and resolution of ellipses (Weischedeland Sondheimer 1982). ¯ The fitted parsing: It happens when the rules of a conventional syntactic gTammarare unable to produce a parse for all elliptical semence.This technique can be used to produce a reasonable approximateparse that can serve as input to remaining stages of natural language processing (Jensen, Heidorn and Richardson 1993). Traditional linguistic theo~’ distinguishes ellipsis forms (i.e., Right-node Raising. Gapping, VP-deletion, CoordinateReduction) according to ,~.’nU~ctic structure of elliptical clauses (Gardent 1992). Syntactic accounts these ellipsis fornts posit the cop.vingof syntactic structure fromthe well-formedclause representation to the elliptical clause one. The used approaches are generally based on a complexsentence grammarthat can not easily Ioeale the elliptical parts of the sentence (Boguraev 1983)and (Huang 1984). In this approaches, elliptical clause occurrence does not play any role. In our approach, we distinguishellipsis classes accordingto syntacticstructure of elliptical sentencesby taking into accountthe following considerations: - occurrenceof elliptical clauses that a sentence contains, - alternationof well-formed clause/elliptical clause, - ellipsis form. Unlike to the above mentioned approaches, our approachis based on another .type of grammarnamelyan ATNclause granunar used to identify the well-formed clause and to locate the elliptical ones. Writinggrammar rules is the first step in designing ATNparser which allows us to have moreinformation about structure and meaningof clauses. Clause parser Syntactic analysis can be performed using complex sentence grammar. But, the specification of such a grammar is complexand requires mucheffort. Besides, it will be ditficult to localise the elliptical parts of the sentence. Ourpropositionis to use a clause grammar.So, weminimizethe effort required in writing the sentence grammar,wecan easily identify the well-formedclause of an elliptical sentenceandthen locate the elliptical ones, and wecan detect the elliptical fragmentsof an elliptical clause. In addition, the less structures are allowed, the higher is the potential ambiguity. Note that an Arabic sentencemaybegin either with a verb phrase(VP)or with a nounphrase(NP). Werestrict our studyto the sentences that begin with VP.Asin (Lappin1995), vt~e consider exceptionphrasefragmentnot as art instance of ellipsis, but as a displacedNPmodifier. In the following,wegive a subset of Arabicgrammar rules: S--> VP + NP S ~ VP+ (NPIPP)+ exception prep + S --, interrogative pro + VP+ NP S -~ S + (NPIPP) NP --> NP + conj+NPINPIlNP-.INPjINP41nounl propernounI pro PP--, prep + (NPt[ NP21NP31NP41noun) NP1---> art + nounI demonstrativepro + noun NP2---> NP1+ AP NP3--* noun+ (NPdN’P2) NP4---> noun+ adj I NP4+ adj VP--, negative adv + verb [verb AP--,art + adj I art + adj + AP. The clause grammarthat weuse for our approachis an augmentedcontext-free clanse grammar.To implementit, weconstruct first a RecursiveTransition Network(RTN) (Woods 1970). Themainnetwork usedinthis section shown inFigure I.Then, wetransform it intoanATNin order to addsubcategorization principle andagreement constraints whichare useful for the resolution processand to avoidambiguity.In order to store these information,the ATN uses registers whichare responsible for holding a structural elementor phrasesuchas noun,verb, adj, N-P, VP, etc. Asentence like Johnand Maryare alike is acceptedby our parser as a well-formedclause but not as a eonjtmtion as in other parser (Woods1970). The mechanismalso overcomes the inheritance of nonsensicalproblems. S: NP,PP Figure I: The mainnetworkof a grammar rules fragment. Themajorcomponents of our clause parser are: - a set of lexical categories for the wordsof the Arabic languageand a lexicon in whicheach wordis assigned the categories that apply to it (Belguith and Ben l-lamadou1996); -an Arabic clause grammar,using the samecategories as the lexicon and whateverconstructs are required by the type of grammarused to specify the well-formed clausestructures: - a parsing program,whoseinputs are the clause to be processed, the lexicon, and the Arabicclause grammar, and whose output for each clause is a structural descriptionof the clause. Ellipsis detection method Webegin by recalling that it is possible, from the linguistic viewpoint,to adopt for the Arabiclanguagethe same typology of ellipses (i.e., Gapping, Right-node Raising, Coordination Reduction,...) astheoneproposed forEnglish andFrench (Gardent 1992), (Hardt 1993), (Lappin 1992)and(Rau1985). In thefollowing, present themaincharacteristics ofellipses intheArabic language. Set formsof ellipses Certain expressions, andnotably letter endings, formsof wishes, certain orders or certain exclamations are elliptical. To understandthem, it isn’t necessary, to construct the completeform. That’s why,they are called falseellipses. Examples: NaturalLanguage Processing 271 Happv new year I .~k.~- I..~ [~ ~.~J (2) j~l ,j~l [j.~-I] (3) b~re, fire These set forms of ellipses do not interest us because thoy can be resolved at the iexical level. But we rather focus our interest on the ellipses that oblige the reader to search in the context the lacking elements without which the messagewill be incomprehensible. The sentence (1) an exampleof this. The nominal ellipsis The nominalellipsis is the omissionof the essential part of a noun phrase (i.e., head) that has to be taken from outside of the part of the previous discourse. Thenin a sentence that contains a nominalellipsis, one of the noun phrases is incomplete. Example: ¯ :’L~"~~-J~r~’] : ,:.’~..,~.,~’ ,.:..-- J ..,~ ~.~(2) 77:e oldest brother is twen~"and the youngest [brotherl is six years old. In example(4), the nominalphrase containing ellipsis is represented by an article and an adjective (the youngest). This treated example contains a nominal ellipsis. The whole phrase ellipsis The whole phrase ellipsis is distinguished from the nominal one by the fact that the omitted constitucnt may Ix: a whole phrase (noun phrase or verb phrase). Therefore, in this category., we can distinguish the followingellipsis forms: ¯ Gapping: The young boy went to school and his brother [went] to universi.tv. ¯ Right-node Raising: .;,~,.)~lit, ll - .,.:...D=~l .~a.-..Ij~t;~~(5) I sell,~’ a bike and[1] gives afishing rod to mysister. ¯ Coordination Reduction: ¯ ~.jL~ .&:;0"1"=--~. ] : ,e.,,-b-,<g.io-"-,.:-4i,.(6) l ask nly brotherfor his bike anti my.~’isterfi)r her fishing rod. Formalism According Io what we have exposed above and the fact that we use an ATNclause grammar, we consider that an elliptical sentence in twodistinct parts: an elliptical part preceded by a well-formed part that can be used as reference to resolve the ellipsis. The elliptical part nfight be composedof a succession of elliptical clauses generally separated with connection words (that we call connectors) which are coordinators, correlative conjunctions, subordinators..... 272 Haddar To formally characterise this phenomena,we musl first introduce somenotations and definitions. Elliptical clause. Let V be the finite Arabic alphabel. Let V" be the sol offinite words over V. V) = V’\{~:}. where is the emptyword. De.fnition l. Wecall clause any finite sequence P of words ofV", P=w)w:...Wm, w~V"and l<i_<m. Wenote +. the set of all clauses that can be construcled fromV Let G be the Arabic clause grammarover V describing thus a subset L(G)c_ ;z,. To verify whether a given clause P is generated by G, we introduce the following function that represents the acceptance function of the clause parser: analyse : P ---> [g {true.f, ilse} true if I’ E L(G P ~ false otherwise. Definition 2. A clause P iscalled well-.[brmedifP is in L(G). <:> ] Note S the set of finite sequencesof clauses of P. Definition 3. Let S=P.P)...P, e.¢. Wedefine the conlexl of a clause Pi, l_<i<nby the subsequenceS,=P,-,PI...P,_I formed si the set of from all clauses preceding P,. Wenote by W all the words composingSt. Remark.The context of P,, is the emptysequence. Definition 4. = Let S=PoP1...P, be a sequence of..¢ and Pi w~wi2...w~,, a clause of P. 0_<i<n. Let the fimction is recoverable be defined by: 1 A clause P,.l~i£n, is called elliptical ifis recm’erahh~ (P,.S). Therefore.a cl;msc Pi from a sequcnccS is elliptical whenit isnt well-formed and whenit can bc complelcd with wordsof its contextin order to be a well-formedone. Remarks." - In the case of a semantic ambiguity the ellipsis resolution will be treated in a semanticlevel. - Examplesas lie is going roll and it raining arc Irealed like elliptical clauses. Elliptical sentence. Let C be a subset of V* thal we call set of connectors containing the emptyword~: too. Definition 5. A sentence Ph is a seqnence of clauses separated with connectors of C. Ph has the following form: LPh= P)clP~i..c.P. whereci ¯ C. P,E p./ J Definition 6. A sentence Ph is called elliptical if 1) the first clause of Ph is weU-formed 2) at least one of the remainingclauses is elliptical. Example1. Let the sentence Ph be: ¯ E ~. ~..,jl ,JL_~[~,j ~..] jI ~. ~, The blue cable mustbe connectedto terminalA-andeither the yellowcable [mustbe connected] to terminalB or the black cable [must be connected] to terminalC. Po:Timbluecablemustbe connected to terminalA. P1:theyellowcableto terminalB. P2:the blackcableto terminalC. Ph is elliptical becausePo is well-formed andP1 andP2 are elliptical. Outline of the detection method Basing on this formalism, we propose an algorithm for elliptical clause location in a sentenceconsistingof three mainstages performedby the clause parser: -Search of connectors: Wesearch for all existent connectors in the analysed sentence relying on a connector lexicon. The obtained result is madeup of a list containingclauses and of an otheronecontainingconnectors. -Identification of the well-formedclause: Wemakea filtering operation of the already obtained lists of clauses and of connectors based on the identification of the longest well-formedclause of them. Oncethe well-formedclause is identified, the two lists are updated by removing the well-formed clause componentsfrom them. -Labellingof the remainingclauses: Weattribute to each clause of the remainingclause list an etiquettemeaning that the clauseis either elliptical or not. Labellingallowus to localise the elliptical clauses of the list. Therefore, in order to havean elliptical sentence, it mustbegin with a well-formedclause and it mustcontain at least oneclauselabelledbythe functionetiquette. At the end of those three stages, wedispose of enough informationto classify elliptical sentencesand later to establish resolution algorithms(HaddarandBenI-Iamadou 1997). Classification of elliptical sentences Wecan distinguishdifferent classes of elliptical sentences by taking into accountthe followingconsiderations: -occurrence of elliptical clauses that a sentence contains, - alternationof well-formed clause/elliptical clause, - ellipsis type. Formalism Wedistinguishthe followingclass definitions: Definition 7. Anelliptical sentencePI~=P0czPl...c,P~, n~l, is called homogeneous if all clauses Pb l<i<n, are elliptical andhavethe sametypeof ellipsis. Example2. Take the elliptical sentence of example1. Since Pz and P2 havethe sametype of ellipsis then Ph is homogeneous. Definition 8. Anelliptical sentencePl~=PoclPl...c., Pn, n>l, is called heterogeneousif every two successive elliptical clausesPi et P~I, l<i<n, havenot the sametype of ellipsis. Important remark. An elliptical sentence that does not containtwo successivcelliptical clausesis D’stematically heterogeneous whateverthe type of ellipsis of clausesthat the sentenceholds. Example3. Let the sentencePh be: The blue cable must be connectedto term[~al A and the yellow cable [must be connected] to terminal B or youmustkeepthemfree. Po: Theblue cable mustbe connectedto terminal A. PI: the yellowcableto terminalB. P2: you mustkeepthemfree. Phis heterogeneous because PoandP2are well-formed andP1is elliptical. Definition9. Lot Phi=PoclPl...c~P,,n>l be anelliptical sentence. PI~is calledmixedif it is neither homogeneous nor heterogeneous. Example 4. Let the elliptical sentencePh be: The blue cable must be connected to terminal A and neither the yellowcable to terminalB oz.r the blackcable to terminalC o_£youmustkeep themfree. Po: Theblue cable mustbe connectedto terminal A. P~:the yellowcableto terminalB. P2:the balckcable to terminalC. P< you mustkeepthemfree. Ph is mixedbccanseP0 and P~ are well-formed,and Pt andP2are elliptical. All these definitions, whichwehave presented, will aUowus to establish criteria to apply for the membership class determinationof the elliptical sentences with the intention to performthe resolutionprocess. Class determination Accordingto definitions 7, 8 and9, in order to determine the class of an elliptical sentence, wehaveto knowthe formsof ellipses of the successiveelliptical clauses. Let’s recall that an elliptical sentence is systematically heterogeneous whenit does not contain successive elliptical clauses. Then, we do not need to knowthe NaturalLanguage Processing 273 ellipsis forms of elliptical clauses of this sentence. The ellipsis form determination of an elliptical clause is not usually easy to implement. That’s why, we propose a heuristic, which is clearly easier to implement. This heuristic is based on the behaviour of connectors: when someconnectors of the Arabic language are successive in a sentence, the clauses whichfollow them usually have the sameform of ellipsis (ff the), are elliptical). To represent this heuristic, we propose to use a matrix M[k,kl on the set {0,1 } where k = card(C), it is called " the Coherence Matrix of Clause Connectors"and it is defined as follows: Let ci, c.i e C. M[c~,cj]=Imeansthat the clause, whichfollows c. if it is elliptical, contains the same b~eas the clause which follows cj in a given sentence where these two clauses are successive. M[ci,c~l = 0 otherwise. This enables us to define a function that divides the set of connectors of a given elliptical sentence in partitions. The number of this partitions determines the membership class of the elliptical seatcnce. If the numberof partitions is cqual to 1 then Phc is homogeneous.If the numberof partitions is equal to the number of clauses of the remaining clause list then Phc is heterogeneous. Otherwise, Ph, is mixed. Conclusion In this study, we have proposed a formal charactcrisation of ellipsis for the Arabic language and a localisation process of elliptical parts of a sentence. Moreover,we have established classification criteria to define appropriate algorithms for ellipsis resolution. The clause parser, the classification algorithm and the detection algorithm arc implemented with C++ within the framework of CORTEXA sy~em (Ben Hamadou 1993). Finally. our future works concern the improvementof the used clause grammar, the resolution of some ambiguities in particular anaphoracases in ellipsis. References Belguith, L., Ben Hamadou,A. 1’)’)6. Marquagemorphosyntaxique robuste dc roots ambigus 6crit en Arabe nonvoycll6. Forumde recherche cn Informatique 96. Tunis. Ben Hamadou, A. 1993. V6rification ct correction automatique par analyse affixale des tcxtes 6crits en langage naturel: le cas de l’arabe non voyell6. Th&c d’Etat en infornmtique. Facult6 des Sciences de Tunis. Boguraev, B. K. 1983. Recognizing conjunctions without the ATN framework. Automatic Natural Language Parsing. Ellis Horwood. Dalr)anple, M.; Shieber, M. S. and Pereira, F. 1991. Ellipsis and higher-order unification. Linguistic and Philosophy 14: 399-452. Gardent, C. 1992. A Multi-Level Approachto Gapping. In 274 Haddar Pmcccdingsof the Stuttgart Ellipsis WorkshopBericht Nr. 29. Stuttgart: Workshop. Haddar, K. and Ben Hamadou, A. 1997. Formal Description of Ellipses in Arabic Language and Resolution Process. In Proceedings of IEEE ICIPS’97, 1775-1779. Pekin: IEEE ICIPS. Hardt, D. 1993. Verb phrase ellipsis: form, meaning, and processing. A dissertation in computer and information science, University of Pcnnsylwania. Huang, X. 1984. Dealing with conjunctions in Machine Translation Environment. In Proceedings of 10th International Conference on Computational Linguistics. Standford University, Palo Alto, California: International Conference on Computational Linguistics. Jensen, K.; Heidorn, G. E. and Richardson, S.D. eds. 1993 Naturel Language Processing: The PLNLP Approach. Kulwer academic publishers. Lappin, S. 1992. The Syntactic Basis of VP Ellipsis Resolution. In Proceedings of the Stuttgart Ellipsis WorkshopBericht Nr. 29. Stutlgart: Workshop. Lappin, S. 1995. Computational Approaches to Ellipsis Resolution. In Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics. Ireland: Conferenceof the European Chapter of the Association for ComputationalLinguistics. Mast. M. ctal. 1994. A speech understanding and dialog system with a homogeneouslinguistic kamwledgcbase. Ih’EE Transaclions on pattern anal.vsis and machine intelligence, vol. 16 no. 2, 179-193. Rau, F. L. 1985. The understanding and generation of ellipses in a natural language system. Technical Report, Cs csd-85-227(dir), 30 pages, University of California Berkeley,. Sabah, G. eds. 1986. L’intelligence art!/icielle et le hmgage. Reading, Mass.:Hermes. Weischedel, R. M. and Sondheimcr N. K. 1982. An improvedheuristic for cllipsis processing. In Proceedings of the 20th Annual Meeting of the ACL,85-88. Woods, W. 197(I. Transition networks grammars for natural language analysis, (I;,iCM 13(10): 591-606.