QUESTION ANSWERING AND Rhetoric ANALYSIS of Biomedical Texts in the AROMA System JOHN KONTOS, IOANNA MALAGARDI and JOHN PEROS Abstract--Question answering with intelligent knowledge management of biomedical texts and analysis of rhetoric relations in the AROMA system is presented. The development of the AROMA system aims at the creation of an intelligent tool for the support of the discovery and adaptation of biomedical models based on data extracted from natural language texts. The system operation includes three main functions namely question answering and text mining and simulation. The question answering function generates model based answers and their explanations. The operation of AROMA allows the exploitation of rhetoric relations between a “basic” text that proposes a model of a biomedical system and parts of the abstracts of papers that present experimental findings supporting the model. An important use of AROMA concerns the comparison of experimental data with the model proposed in the basic text. The AROMA system consists of three subsystems. The first subsystem extracts knowledge including rhetoric relations from biomedical texts. The second subsystem answers questions with causal knowledge extracted by the first subsystem and generates explanations using rhetoric relation knowledge in addition to other knowledge. The third subsystem simulates the time-dependent behavior of a model from which textual descriptions of the waveforms are generated automatically. Index Terms-- rhetoric relations, intelligent biomedical text mining, knowledge discovery from text, simulation, question answering, explanation, p53, mdm2. I. INTRODUCTION The new AROMA (Automatic Rhetoric Organizer for Model Analysis) system is presented in the present paper and an illustrative example of application is described. This system is an intelligent computer tool for question answering and text mining [5] of biomedical knowledge including the management of rhetoric knowledge. Parts of the older version of the system were presented in [6], [7], [8], [9], [10]. ARTIFICIAL INTELLIGENCE GROUP DEPARTMENT OF PHILOSOPHY AND HISTORY OF SCIENCE NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS E-mail: ikontos2003@yahoo.com,imal@gsrt.gr The development of the AROMA system aims at the creation of an intelligent tool for the support of the discovery and adaptation of biomedical models based on data extracted from natural language texts. The system operation includes two main functions namely question answering and text mining. The question answering function generates model based explanations of the answers. The expanded operation of AROMA allows the exploitation of rhetoric relations between a basic text that proposes a model of a biomedical system and parts of the text of papers that present experimental data supporting the model. In a typical application of the system a theoretical text presenting the model of a system under scientific investigation is taken as the “basic” one and a computerized method is used for the analysis of the relationship of the model to the texts of the papers that provide experimental support of the model or theory proposed by it. Text mining is applied both to the basic paper and the supporting papers in order to extract the knowledge fragments that have to be rhetorically related and computationally processed. An important aspect of the application of the system is the answering of natural language questions for the computer aided comparison of new experimental data with the predictions of the model proposed in the “basic” paper. The friendly man-machine interface of the system interface is capable of answering such questions and generating explanations that help the user in tracing the support of the model by experimental and background domain knowledge. It is envisaged that the AROMA System may prove useful for the support of the discovery activity of research scientists with the intelligent management of scientific knowledge mined from texts. Text mining differs from data mining [2] in that it uses unstructured texts for the collection of knowledge rather than structured sources like databases. We define intelligent text mining as the process of creating novel knowledge from texts that is not stated in them by combining sentences with deductive reasoning. An early implementation of this kind of intelligent text mining was reported by us in [4]. A review of different kinds of text mining including intelligent text mining is presented in [11]. There are two possible methodologies of applying deductive reasoning to texts. The first methodology is based on the translation of the texts into some formal representation of their “content” with which deduction is performed [12]. The advantage of this methodology depends on the simplicity and availability of the required inference engine but its serious disadvantage is the need for reprocessing all the texts and storing their translation into a formal representation every time something changes in their domain. In the case of scientific texts what may change is some part of the background knowledge such as the ontology used for deducing new knowledge. The second methodology eliminates the need for translation of the texts into a formal representation because an inference engine capable of performing deductions “on the fly”, i.e. directly from the original texts, is implemented. names that occur in the biomedical texts that we use for the present applications. In this way causal relations between entities and processes are expressed. A lexicon containing words such as causal verbs and stop words are used by this subsystem. An output file is produced by the system that contains parts of sentences collected from the original sentences of different input texts. These output file is used for reasoning by the second subsystem. The input files used for this subsystem in the example of p53-mdm2 dynamics contain texts downloaded from MEDLINE. The operation of the subsystem is based on the recognition of noun phrases and verb groups and their relations. B. The Causal Reasoning Subsystem A disadvantage of this second methodology is that a more complex inference engine than the one needed by the first one must be built. The implementation of such inference engines has however proved feasible for causal knowledge. The strong advantage of the second methodology is that the translation into a formal representation is avoided. In [4] we proposed the second methodology and the method we developed was therefore called ARISTA i.e. Automatic Representation Independent Syllogistic Text Analysis. The ARISTA method is used in the AROMA system for its basic reasoning functions. An early attempt was also made by us in [4] to implement a model-based question answering system using the scientific text as a knowledge base describing a qualitative model. One of our early application examples concerned medical text mining related to human respiratory system mechanics [4]. Biomedical text mining is now recognized as a very important field of study [3] and [17] particularly for molecular biology. The system GeneWays [16] is a typical example of a biomedical text analysis system for the extraction of molecular pathway data. The idea of combining text mining with simulation and question answering was pursued further by our group as reported in [7], [8], [9] and [10] as well as in the present paper. The general architecture of our AROMA system consists of three subsystems namely the Knowledge Extraction Subsystem, the Causal Reasoning Subsystem and the Simulation Subsystem. These subsystems are briefly described below. II. THE GENERAL ARCHITECTURE OF THE AROMA SYSTEM A. The Knowledge Extraction Subsystem This subsystem integrates partial causal knowledge extracted from a number of different texts. This knowledge is expressed in natural language using causal verbs such as “regulate”, “enhance” and “inhibit”. These verbs usually take as arguments entities such as protein names and gene The output of the first subsystem is used as input to the second subsystem that combines background knowledge with causal knowledge in natural language form to produce by automatic deduction conclusions not mentioned explicitly in the input text. The operation of this subsystem is based on the ARISTA method [4] and results in the recognition of causal relations on-the-fly of the form “causes(process1, entity1, process2, entity2, manner)”. The pair (process1, entity1) stands for the cause, the pair (process2, entity2) stands for the effect and “manner” stands for the kind of causality i.e. whether it is positive or negative. The sentence fragments containing causal knowledge are analyzed and the entity-process pairs are recognized. The user questions are analyzed and reasoning goals are extracted from them. The answers to the user questions are generated automatically by a reasoning process together with explanations in natural language form. This is accomplished by the chaining “on the fly” of causal statements using background knowledge such as an ontology to support the reasoning process. C. The Simulation Subsystem The third subsystem is used for modelling the dynamics of the biomedical system specified in the “basic” text. The characteristics of the model such as structure and parameter values are extracted from the input texts combined with background knowledge such as ontology and default process and entity knowledge [9]. Considering the p53mdm2 example two coupled first order differential equations were used as the approximate mathematical model of the biomedical system in rough correspondence with the model proposed in [1]. A basic characteristic of the behaviour of such a system is the occurrence of oscillations for certain values of the parameters of the equations. Two finite difference equations that approximate the differential equations system of the model are: Δx= a1*x + b1*y + c1*x*y Δy= a2*y + b2*delay(d,x) (1) (2) where Δx means the difference between the value of the variable x at the present time and the value of the variable x at the next time instant and delay(d,x) at equation (2) stands for the value that x had d units of time before present time. Time is taken to advance in discrete steps. The symbols x and y are the variables that represent the concentrations of the proteins p53 and mdm2 respectively. The symbols a1, b1, c1, a2, b2 are the coefficients of the equations that represent the parameters of the biomedical system. It is noted that multiplicative term c1*x*y renders equation (1) non-linear. This non-linearity causes the appearance of the oscillations to differ from simple sine waves. The solution of these equations is accomplished with a Prolog program that provides an interface for manipulating the parameters of the model. An important module of the simulation subsystem is one that generates text describing the behaviour of the variables of the model on true. The above information concerning the connection of the parts of the differential equations with the parts of the biomedical system being modelled is formalized using rhetorical relations explained below. “parbib” and “entfra” are of the exr kind and “behrel” and “tfollows” are of the mbr kind and are defined below. The formal representation of the rhetoric relations used by our system consists of a single predicate rr(PAR_1,…,PAR_n) where the arguments PAR_1 to PAR_n stand for n parameters that define a rhetoric relation between two rhetorically related “objects”. These objects may be of different kinds such as sentences or other text fragments, equations, equation variables, physical quantities, citations, references or parts of waveforms representing the time behavior of model variables. The formal representation: rr(relation_name,relation_kind, first_object_kind,first_object_identifier, second_object_kind,second_object_identifier). is used below for the illustration of some rhetorical relations used by our system. a. Internal Relations (inr) III. THE RHETORIC RELATIONS Research on the rhetoric or discourse analysis of biomedical texts has started only recently [15]. In [15] an annotation scheme is proposed for a rhetorical analysis of biology articles using the “zoning” method. This method characterizes parts of a scientific text using an annotation scheme to identify these parts or “zones”. In our system we apply the rhetoric relations approach [13, 14] in contrast with the zoning approach. This approach connects parts of sentences and other textual fragments using rhetoric relations. About 50 rhetoric relations have been theoretically defined in [13] but only 4 were computationally defined in [14] namely “contrast”, “cause-explanation-evidence”, “elaboration” and “condition” that were defined at a much coarser level of granularity for practical reasons. It is however proposed to enrich this approach of analysis by using a few more rhetoric relations necessary for the representation of the content of scientific texts related to models of systems. We distinguish between two kinds of purely textual rhetoric relations namely internal to the “basic” paper (symbolized as “inr”) and external to it (symbolized as “exr”). A third kind of relations (symbolized as mbr) is proposed that formalizes the appearance of the waveforms of the time behaviour of the model variables. These waveforms are treated as “numerical narratives” with a “rhetoric” structure of events equivalent to the “pattern structure” of the waveforms where peaks and valleys or maxima and minima play the role of events. All these kinds of relations are briefly described below and illustrated using examples from [1] where “coerel” and “varrel” are of the inr kind, 1) The relation name “coerel” stands for relations between a coefficient “c” of a mathematical model and the corresponding entity property or parameter “ep” of the biomedical system modeled. Or formally: “rr(coerel,inr,coefficient,c,parameter,ep). 2) The relation name “varrel” stands for relations between a variable “x” of an equation of the model and a physical entity “e” or an entity property “ep” of the biomedical system modeled. Or formally: “rr(varrel,inr,variable,x,entity,e) and “rr(varrel,inr,variable,x,entityp,ep). b. External Relations (exr) 3) The relation name “parbib” stands for relations between a parameter “p” of the biomedical system and a bibliographic reference “r” to a paper that contains experimental data that support the inclusion and possibly the numerical value of this parameter. Or formally: rr(parbib,exr,parameter,p,reference,r). 4) The relation name “entfra” stands for relations between an entity “e” of a biomedical system and a text fragment labeled “f” of a reference text labeled “r” and is symbolized by “r_f” that presents the experimental data that support the inclusion and possibly the numerical values of the properties of the entity “e”. Or formally: rr(entfra,exr,ent,e,fragment,r_f). c. Model Behavior Relations (mbr) 5) The relation name “behrel” stands for relations between some coefficient “c” of the model the time behavior of a variable “v” of the model. Or formally: “rr(behrel,mbr,coefficient,c, variable,v). 6) The relation name “tfollows” stands for time ordering relations between a part in position “wp1” of a waveform representing the time variation of a variable v1 of the model such as a peak or a valley symbolized as wp1_v1 and a part in position “wp2” of a waveform representing the time variation of variable v2 and symbolized as wp2_v2. The kind of object “waveform part” is symbolized as “wpart”. extracted from the sentence: “Here the coefficient p1 denotes the rate of p53-independent mdm2 transcription and translation (24)” The abstracts of two papers presenting experimental data supporting the qualitative model proposed by [1] named with the labels “32” and “92” can be used to illustrate the question answering process of the AROMA system and the use of external rhetoric relations in the explanations generated by the question answering process. The first abstract labeled as “32” as listed in [1] consists of six sentences from which two are selected by the first subsystem of AROMA and from which the following two sentence fragments are extracted automatically: 1.“The p53 protein regulates the mdm2 gene” Or formally: rr(tfollows,mbr,wpart,wp1_v1,wpart,wp2,p2_v2) IV. SOME RHETORIC RELATIONS IN THE P53-MDM2 EXAMPLE In the p53-mdm2 example the rhetoric relations extracted concern the text [1] that proposes a mathematical model for the interaction of the proteins p53 and mdm2 as well as the MEDLINE abstracts of papers related to the model proposed by [1] either used or not as references by [1]. These abstracts were downloaded from MEDLINE and contain knowledge of experimental results concerning the interaction of the proteins p53 and mdm2. These proteins are involved in the life cycle of the cell and interact through a negative feedback system. Some rhetoric relations found in the text [1] are: r1:rr(coerel,inr,coefficient, sourcep53, parameter,synthesis_rate_of_the_p53_protein) extracted from the sentence: “Here the coefficient sourcep53 specifies the synthesis rate of the p53 protein.” r2:rr(coerel,inr,coefficient,activity,parameter,p53's_sequenc e-specific_DNA_binding activity” extracted from the sentence: “The coefficient activity can include p53's sequence-specific DNA binding activity” r3:rr(varrer,inr,variable, degradation(t),entity,rate of degradation) extracted from the sentence: “The variable degradation(t) measures the rate of degradation” r4:rr(coerel,inr,coefficient, p1,entity, rate_of_p53independent_mdm2_transcription) and r5:rr(parbib,exr,entity, mdm2,reference,26) 2.“regulates both the activity of the p53 protein” These fragments are then automatically transformed to Prolog facts in order to be processed by the second subsystem as shown below: t(“32_5”, “The p53 protein regulates the mdm2 gene”). t(“32_6”, “regulates both the activity of the p53 protein”). The labels 32_5 and 32_6 denote that these fragments are extracted from the sentences 5 and 6 of the text with label 32. The second abstract labeled by the number “92” that was found independently of [1] consists of seven sentences from which two are selected by the first subsystem of AROMA from which the following two sentence fragments are extracted automatically: 3.“The mdm2 gene enhances the tumorigenic potential of cells” 4.“The mdm2 oncogene can inhibit p53_mediated transactivation” and expressed in the form of Prolog facts as: t(“92_3”, “The mdm2 gene enhances the tumorigenic potential of cells”). t(“92_7”, “The mdm2 oncogene can inhibit p53_mediated transactivation”). The labels 92_3 and 92_7 denote that these fragments are extracted from the sentences 3 and 7 of the text with label 92. Using the above sentence fragments of the p53-mdm2 example our system discovers the causal negative feedback loop by appropriate chaining of the relevant sentence fragments and represents it as: p53 +causes mdm2 -causes p53 the EXPLANATION is: Where since <oncogene> is a kind of <gene> p53 protein -causes p53 +causes means “causes increase” and -causes means “causes decrease or inhibition”. because This is effected by answering the question: p53 protein +causes gene of mdm2 Is there a process loop of p53? and This question is internally represented as the Prolog goal: oncogene of mdm2 -causes p53 mediated transactivation of p53. “cause(P1,p53,P2,p53,S)” where P1 and P2 are two process names that the system extracts from the texts and characterize the behavior of the entity p53. S stands for the overall effect of the feedback loop found i.e. whether it is a positive or a negative feedback loop. In this case S is found equal to “-” since a positive causal connection is followed by a negative one. The short answer automatically generated by our system is: Yes. The loop is p53 activity –causes p53 production. By a short answer we mean a simple answer not connected to any explanation of the reasoning followed for the derivation of the answer. The long answer automatically generated by our system together with an appropriate explanation is as follows: It should be noted that the combination of sentence fragments (92_7) and (32_5) in a causal chain that forms a closed negative feedback loop is based on two facts of background knowledge. This background knowledge is inserted manually in our system as Prolog facts and can be stated as: the DEFAULT entity of <p53_mediated> is <p53> or default(p53_mediated, p53). <oncogene> is a kind of <gene> or kind_of(oncogene, gene). The above analysis of the text fragments of the example is partially based on the following background linguistic and domain knowledge which is manually inserted as Prolog facts: Linguistic Knowledge: The QUESTION is: cause(P1,p53,P2,p53,S). kind_of(“the”,“determiner”). kind_of(“is”,“copula”). kind_of(“of”,“preposition”) kind_of(“activated”,“causal_connector”). kind_of(“inhibits”,“causal_connector”). kind_of(“regulated”,“causal_connector”). USING INFERENCE RULE IR4a Domain Knowledge: since the DEFAULT entity of <p53_mediated> is <p53> kind_of(“protein”,“entity”). kind_of(“DNA”,“entity”) kind_of(“p53”,“entity”). kind_of(“Mdm2”,“entity”). kind_of(“damage”,“process”). kind_of(“expression”,“process”). kind_of(“increase”,“process”). kind_of(“activity”,“process”). “Is there a process loop of p53 ? ” Represented internally in Prolog as: with rhetoric relations: rr(entfra,exr,entity,p53,fragment,92_7) rr(entfra,exr,entity,mdm2,fragment,92_7) USING INFERENCE RULE IR4b with rhetoric relations: rr(entfra,exr,entity,p53,fragment,32_5) rr(entfra,exr,entity,mdm2,fragment,32_5) The above knowledge base fragment contains both linguistic and domain knowledge to support the analysis of the sentences occurring in the corpus. In practice these two parts of knowledge are handled differently by the inference rules of the reasoning module. V. AUTOMATIC DESCRIPTION OF THE BEHAVIOR OF THE MODEL VARIABLES The system can generate automatically descriptions of the the time behaviour of the concentration of the two proteins that are represented by the two variables of the model using the “tfollows” rhetoric relation. More details may be found in [9] an [10]. An example is shown below of an automatically produced description of the numerical results produced by the solution of the model equations where T stands for time and protein names capitalized as P53 and MDM2 for displaying emphasis only: “A peak of P53 at T=3.6 P53=28830 is followed by a peak of MDM2 at T=6.4 MDM2=16550 which is followed by a valley of P53 at T=8.6 P53=-6100 which is followed by a valley of MDM2 at T=14.6 MDM2=9360 which is followed by a peak of P53 at T=17 P53=670”. VI. CONCLUSIONS In the present paper we presented the question answering function of our AROMA system with intelligent knowledge management and rhetoric analysis of biomedical texts related to the modeling of a biomedical system. The AROMA system that we have developed consists of three main subsystems. The first subsystem achieves the extraction of knowledge from texts that is related to the structure and the parameters of the biomedical system simulated. The second subsystem is based on a reasoning process that answers questions by combining causal knowledge extracted by the first subsystem with background knowledge and generates explanations of the reasoning followed. The third subsystem is a system simulator written in Prolog that generates the time behavior of the model’s variables. An important feature of the system presented here is its ability for model based non-factoid question answering with the use of rhetoric relation recognition and intelligent causal knowledge extraction from scientific texts with explanation generation and automatic generation of textual descriptions of the dynamic behavior of the model of a biomedical system. References [1] [2] [3] [4] [5] Bar-Or, R. L. et al (2000). Generation of oscillations by the p53Mdm2 feedback loop: A theoretical and experimental study. PNAS, vol. 97, No 21 pp. 11250-11255. Barrera J. et al (2004). An environment for knowledge discovery in biology. Computers in Biology and Medicine, vol 34 pp 427-447. Hirshman L. et al (2002). Accomplishments and challenges in literature data mining for biology. Bioinformatics, vol. 18, 12, pp 1553-1561. Kontos, J. (1992). ARISTA: Knowledge Engineering with Scientific Texts. Information and Software Technology. vol. 34, No 9, pp 611-616. Kontos, J. and Malagardi, I. (1999). Information Extraction and Knowledge Acquisition from Texts using Bilingual Question- [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] Answering. Journal of Intelligent and Robotic Systems, vol 26, No. 2, pp. 103-122. Kontos, J. and Malagardi, I. (2001). A Search Algorithm for Knowledge Acquisition from Texts. Proceedings of HERCMA 2001, 5th Hellenic European Research on Computer Mathematics & its Applications Conference. Athens. Kontos, J. et al (2002a). ARISTA Causal Knowledge Discovery from Texts Discovery Science 2002 Luebeck, Germany. Proceedings of the 5th International Conference DS 2002 Springer Verlag. pp 348-355. Kontos, J. et al (2002b). System Modeling by Computer using Biomedical Texts. Res Systemica, volume 2, Special Issue, October 2002. (http:/www.afscet.asso.fr/resSystemica/accueil.html). Kontos J. et al (2003a). The Simulation Subsystem of the AROMA System HERCMA 2003. Proceedings of 6th Hellenic European Research on Computer Mathematics & its Applications Conference, Athens. Kontos J. et al (2003b). The AROMA System for Intelligent Text Mining. HERMIS, vol 4, pp 163-173. Kroetze J. A. et al (2003). Differentiating Data- and Text-Mining Terminology. Proceedings of SAICSIT 2003, pp 93-101. Kuehne S. E. and Forbus K. D. (2004). Capturing QP-relevant Information from Natural Language Text. Proceedings of the 18th International Workshop on Qualitative Reasoning, August, Evanston, Illinois, USA. Mann, W. C. and Thompson, S. A. (1988). Rhetorical structure theory: toward a functional theory of text organization. Text, 8(3), pp 243-281. Marcu, D. & Echihabi, A. (2002). An unsupervised approach to recognizing discource relations. ACL 2002. Mizuta, Y. & Collier N. (2004). An Annotation Scheme for Rhetorical Analysis of Biology Articles. Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). Lisbon, Portugal. Rzhetsky, A. et al (2004). GeneWays: a system for extracting, analyzing, visualising and integrating molecular pathway data. Journal of Biomedical Informatics. vol 37, pp. 43-53. Shatkay, H. and Feldman R. (2003). Mining the Biomedical Literature in the Genomic Era: An Overview. Journal of Computatinal Biology, 10 (3) 821-855.