GRAMMARS FOR QUESTION ANSWERING SYSTEMS BASED ON INTELIGENT TEXT MINING IN BIOMEDICINE JOHN KONTOS, JOHN LEKAKIS, IOANNA MALAGARDI and JOHN PEROS Abstract--A top-down approach to the determination of grammar engineering methods for the construction of Natural Language Grammars for Question Answering Systems based on Intelligent Text Mining for Biomedical applications is presented in the present paper. Our proposal is to start with the formulation of the ultimate task goal such as specialized question answering and derive from it the specifications of the grammar engineering methods such as hand crafting or machine learning. The task goal we have experience in involves syllogistic inference with scientific texts and models in biomedicine. An important kind of knowledge that we deal with is causal knowledge expressed either linguistically or formally. The linguistic expressions we deal with involve causal verbs that connect noun phrases that denote entityprocess pairs. An illustrative example of a Text Grammar is given and a pilot experiment towards Memory Based Induction using TiMBL is described. The results obtained so far render problematic the automatic induction of multilevel grammars. Index Terms-- Question Answering, Intelligent Text Mining, Biomedical Texts, Biomedical Models. There are two possible methodologies of applying deductive reasoning on texts. The first methodology is based on the translation of the texts into some formal representation of their “content” with which deduction is performed. The advantage of this methodology depends on the simplicity and availability of the required inference engine but its serious disadvantage is the need for preprocessing all the texts and storing their translation into a formal representation every time something changes. In particular in the case of scientific texts what may change is some part of the prerequisite knowledge such as ontology used for deducing new knowledge. The second methodology eliminates the need for translation of the texts into a formal representation because an inference engine capable of performing deductions “on the fly” i.e. directly from the original texts is implemented. The only disadvantage of this second methodology is that a more complex inference engine than the one needed by the first one must be built. The strong advantage of the second methodology is that the translation into a formal representation is avoided. In [4] the second methodology is chosen and the method developed is therefore called ARISTA i.e. Automatic Representation Independent Syllogistic Text Analysis. I. INTRODUCTION A top-down approach to the determination of grammar engineering methods for the construction of Natural Language Grammars for Question Answering Systems based on Intelligent Text Mining for Biomedical applications is presented in the present paper. Our proposal is to start with the formulation of the ultimate task goal such as specialized question answering and derive from it the specifications of the grammar engineering methods such as hand crafting or machine learning. The task goal we have experience in involves syllogistic inference with scientific texts and models in biomedicine. We define Intelligent Text Mining as the process of creating novel knowledge from texts that is not stated in them by combining sentences with deductive reasoning. ARTIFICIAL INTELLIGENCE GROUP DEPARTMENT OF PHILOSOPHY AND HISTORY OF SCIENCE NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS E-mail: ikontos2003@yahoo.com,imal@gsrt.gr The ultimate task goal of the applications of grammars that we are interested in is the development of systems for knowledge extraction and processing with intelligent text mining of biomedical texts. An important kind of knowledge that we deal with is causal knowledge expressed either linguistically or formally. The linguistic expressions we deal with involve causal verbs that connect noun phrases that denote entity-process pairs. In certain cases it is hoped that the knowledge obtained in this way leads to giving novel answers to the questions posed to the system accompanied by appropriate automatically generated explanations [4], [5], [6], [7]. The ultimate goal of a Question Answering system based on Intelligent Text Mining is to extract a meaningful answer from the underlying text sources. Answer extraction depends on the complexity of the question, on the answer type provided by question processing, on the actual texts where the answer is searched, on the search method and on the question focus and context. The requested information might be present in a variety of heterogeneous data sources, that must be searched by different retrieval methodologies. Several collections of texts might be organized into separate structures, with different index operators and retrieval techniques. Moreover, additional information may be organized in databases and models or may simply reside on the Web, in HTML or XML format. The context could be represented in different formats, according to the application but the search for the answer must be uniform in all sources either in textual or other formats and must be identified uniformly. To enable this process different indexing and retrieval techniques for Q&A must be employed and methods of detecting overlapping information as well as contradictory information must be developed. Moreover, answers may be found in existing knowledge bases or ontologies. During the answer extraction process itself at least three problems arise. The first one is related to the assessment of answer correctness, and implicitly to ways of justifying its correctness. The second problem relates to the method of knowing whether all the possible correct answers have been found throughout the text sources. The third problem is that of generating a coherent answer when its information is distributed across a document or throughout different documents. An extension of this problem is the case when special reasoning must be generated prior to the extraction of the answer as the simple juxtaposition of the answer parts is not sufficient but the answer must be derived deductively. To justify the correctness of an answer, explanations, in the most natural format, should be made available. Explanations may be based on a wide range of information, varying from evidence enabled by natural language texts to commonsense and pragmatic knowledge, available from on-line knowledge bases. A back-chaining mechanism implemented into a theorem prover may generate the explanation for an answer in the most complex cases, and translate it into a form easily and naturally understood. In the case of textual answers, textual evidence is necessary for the final justification of the answer. II. THE TOTAL QUESTION ANSWERING TASK The AROMA (ARISTA Oriented Modeling Adaptation) system for Intelligent Text Mining is the implementation of a representative solution of the total Question Answering task. AROMA accomplishes knowledge discovery using our knowledge representation independent method ARISTA that performs causal reasoning “on the fly” directly from text in order to extract qualitative or quantitative model parameters as well as to generate an explanation of the reasoning. The application of the system is demonstrated by the use of two biomedical examples concerning cell apoptosis and compiled from a collection of MEDLINE paper abstracts related to a recent proposal of a relevant mathematical model. A basic characteristic of the behavior of such biological systems is the occurrence of oscillations for certain values of the parameters of their equations. The solution of these equations is accomplished with a Prolog program. This program provides an interface for manipulation of the model by the user and the generation of textual descriptions of the model behavior. This manipulation is based on the comparison of the experimental data with the graphical and textual simulator output. This is an application to a mathematical model of the p53-mdm2 feedback system. The dependence of the behavior of the model on these parameters may be studied using our system. The variables of the model correspond to the concentrations of the proteins p53 and mdm2 respectively. The non-linearity of the model causes the appearance of oscillations that differ from simple sine waves. The AROMA system for Intelligent Text Mining consists of three subsystems as briefly described below. The first subsystem achieves the extraction of knowledge from individual sentences that is similar to traditional information extraction from texts. In order to speed up the whole knowledge acquisition process a search algorithm is applied on a table of combinations of keywords characterizing the sentences of the text. The second subsystem is based on a causal reasoning process that generates new knowledge by combining “on the fly” knowledge extracted by the first subsystem. Part of the knowledge generated is used as parametric input to the third subsystem and controls the model adaptation. The third subsystem is based on system modeling that is used to generate time dependent numerical values that are compared with experimental data. This subsystem is intended for simulating in a semi-quantitative way the dynamics of nonlinear systems. The characteristics of the model such as structure and parameter polarities of the equations are extracted automatically from scientific texts and are combined with prerequisite knowledge such as ontology and default process and entity knowledge. The AROMA system may answer a variety of questions some Question 1. Causal antecedent 2. Causal consequence 3. Enablement 4.Instrumental/ Procedural Abstract Specification What caused some event to occur? What state or event causally led to an event or state? What are the consequences of an event or state ? What causally unfolds from an event or state ? What object or resource enables an agent to perform an action ? What instrument or body part is involved when a process is performed ? kinds of which are briefly specified in the following illustrative Taxonomy. III. CAUSAL KNOWLEDGE IN NATURAL LANGUAGE TEXTS An important kind of knowledge that we deal with is causal knowledge expressed either linguistically or formally. The linguistic expressions we deal with involve causal verbs that connect noun phrases that denote entity-process pairs. The texts which we are interested in analyzing when studied are found contain causal knowledge that appears in the form of information delivered both "directly" and "indirectly". Sentences delivering causal knowledge indirectly are in most cases embedded sentences, the main verb being a metascience operator (e.g. discovered, observed, has shown how, found that, demonstrate that, argued that, explain how etc.). By "causal knowledge", thus, we mean any form of knowledge relating to the interaction of objects/entities and events or processes in some model, i.e. knowledge about how processes related to certain entities result from or lead to processes associated with other or the same entities. This knowledge is mainly expressed in the form of causal relations. In terms of the "state" - "event"(/"process") pair, the types of causal links dealt with in the example text are described as events/processes causing events/processes or conditions (states or events) allowing some causal link. In natural language, "causality" may be expressed in a variety of linguistic forms. The linguistic forms expressing the parts of causal relations involve, among others, clausal adverbial subordination, clausal adjectival subordination, and discourse connection. A causal relation, typically, is a pair consisting of an "antecedent" and a "consequent". Antecedent and consequent may appear in two connected sentences or in terms of some causal implication. In passivised sentences the by-phrase introduces the antecedent of the causal relation. Antecedents of causal relations may also be identified in by-phrases associated with past participles. Some verbs and participles may function as mere causal connectives (e.g. cause), in contrast to others (e.g. increase.) which may have some additional meaning. Processes, which function as antecedents or consequents of some causal relation, can also be referred to in terms of nominalizations in causal-relation-delivering sentences. In dealing with sentential constituents -rather than complete sentences- which may deliver one of the parts of some causal relation, we should note that adverbial modification may also provide material qualifying for such antecedents/consequents. Simple causal relations acquired from texts may be recognized by means of a four-argument predicate with the arguments: (a) the consequent process related to some entity in the physical system (b) the entity in the physical system involved in the consequent (c) the antecedent process related to a similar (or the same) entity (d) the entity involved in the antecedent, The automatic extraction of the appropriate terms which will fill in the process-and-entity slots in the "cause" predicate involves the following steps: First, phrases qualifying for antecedents/consequents are identified. Due to the variety of linguistic forms encoding causal relations, as indicated above, there may be more than one causal relation encoded in one sentence. Second, the terms representing each process-entity structure of the antecedent and consequent are identified. Third, after the identification of a process the entity which undergoes this process must fill in the entity-slot of the same antecedent/consequent "process-entity" structure. In cases where there are two entity nouns, in a phrase, qualifying for some antecedent/consequent, a choice is made between these entity nouns on the basis of some additional features. There is a further point to be made concerning the identification of the entity noun which is to fill in an entityslot, namely, the recovery of a node related to some process, when the former is absent from the text. The linguistic items constituting the entity-process structures, in complex causal relations, have to be recovered from a number of complex linguistic expressions used in the text. Some of the complex linguistic phenomena which may be involved in the recovery of such items are anaphora and deixis, noun modification, coordination, clausal subordination such as relative, temporal, conditional and concessive clauses. IV. AN ILLUSTRATIVE EXAMPLE OF A GRAMMAR An Illustrative example of a Text Grammar is given below in a BNF-like form. Sentence Structure s:- np cc np Noun Phrase Structures np:- det np | q np | adj np | det adj np. np:- np pp | pnp pp. pnp:-d prn. Propositional Phrase Structures pp:- prep np | prep pp. Process Entity Pairs pe:- np prn agp | prs ens | prs def. Entity Process Pairs ep:- ens prs | np. Aspirin/NNinhibits/VBZnuclear/JJfactor_kappa/NNB/NNm obilization/NNand/CCmonocyte/NNadhesion/NNin/INstim ulated/VBNhuman/JJendothelial/JJcells/NNS./. Process Strings prs:- vb prs | adj proc | prn pp | proc np an np an | q np adj prn | np adj prn | np perm prs | prvn np perm | prvn det prn | prn prn | det prn | prvn(P). The/DTantioxidant/JJpyrrolidine/NNdithiocarbamate/NN(/( PDTC/NN)/)specifically/RBinhibits/VBZactivation/NNof/I Nnuclear/JJfactor_kappa/NNB/NN(/(NF_kappa/NNB/NN)/ )./. PREVENT prvn:-proc(X) | prn(X). Entity Strings ens:-pp | np | prep np | adj noun Causal Connectors cc:-“causes” | “induces” | “inhibits” | “as” | copula “caused by” | “forces” | “opposes”. A/DTspecific/JJinhibitor/NNof/INcaspases/NNS,/,zVAD_F MK/NN,/,prevents/VBZtriptolide_induced/JJPARP/NNdeg radation/NNand/CCDNA/NNfragmentation/NNbut/CCnot/ RBgrowth/NNarrest/NN./. Mutation/NNof/INthe/DTcore/NNETS/NNbinding/NNsite/ NNfrom/IN_GGAA_/NNto/TO_GGAT_/NNprevents/VBZ the/DTbinding/NNof/INETS_like/JJfactors/NNSwith/INthe /DTexception/NNof/INETS1/NN./. PRODUCE V. A PILOT EXPERIMENT TOWARDS MEMORY BASED GRAMMAR INDUCTION A pilot experiment towards Memory Based Grammar Induction using the TiMBL system [2] is described. The corpus used is the Genia corpus [3] used by the biocreative European competition. Causal sentences containing causal verbs were collected as positive examples. The causal verbs used are verbs like cause, induce, inhibit etc. Illustrative causal sentences containing such verbs are listed below in the POS tagged form found in the Genia corpus. Dexamethasone/NNproduced/VBDa/DTrapid/JJand/CCsust ained/JJincrease/NNin/INglucocorticoid/NNresponse/NNel ement/NNbinding/NNand/CCa/DTconcomitant/JJ40_50/C D%/NNdecrease/NNin/INAP_1/NN,/,NF/NNkappa/NNB/N N,/,and/CCCREB/NNDNA/NNbinding/NNthat/WDTwas/ VBDblocked/VBNby/INcombined/JJdexamethasone/NNan d/CCcytokine/NNor/CCPMA/NNtreatment/NN./. Targeted/VBNremodeling/NNof/INhuman/JJbeta_globin/N Npromoter/NNchromatin/NNstructure/NNproduces/VBZin creased/VBNexpression/NNand/CCdecreased/VBNsilencin g/NN./. CAUSE REDUCE TI/LS_/:Hypoxia/NNcauses/VBZthe/DTactivation/NNof/I Nnuclear/JJfactor/NNkappa/NNB/NNthrough/INthe/DTpho sphorylation/NNof/INI/NNkappa/NNB/NNalpha/NNon/INt yrosine/NNresidues/NNS./. Nitric/JJoxide/NNselectively/RBreduces/VBZendothelial/JJ expression/NNof/INadhesion/NNmolecules/NNSand/CCpr oinflammatory/JJcytokines/NNS./. TI/LS_/:E2F_1/NNblocks/VBZterminal/JJdifferentiation/N Nand/CCcauses/VBZproliferation/NNin/INtransgenic/JJme gakaryocytes/NNS./. INDUCE IL_4/NNinduces/VBZCepsilon/NNgermline/NNtranscripts/ NNSand/CCIgE/NNisotype/NNswitching/NNin/INB/NNcel ls/NNSfrom/INpatients/NNSwith/INgammac/NNchain/NN deficiency/NN./. Binding/NNof/INactivated/VBNplatelets/NNSinduces/VBZ IL_1/NNbeta/NN,/,IL_8/NN,/,and/CCMCP_1/NNin/INleuk ocytes/NNS./. INHIBIT The/DTincrease/NNof/INcortisol/NNis/VBZimmunosuppre ssive/JJand/CCreduces/VBZGR/NNSconcentration/NNboth /CCin/INnervous/JJand/CCimmune/JJsystems/NNS./. PROMOTE Ligand_induced/JJtyrosine/NNphosphorylation/NNof/INthe /DTSTATs/NNSpromotes/VBZtheir/PRP$homodimer/NNa nd/CCheterodimer/NNformation/NNand/CCsubsequent/JJn uclear/JJtranslocation/NN./. Tax/NNexpression/NNpromotes/VBZN_terminal/JJphosph orylation/NNand/CCdegradation/NNof/INIkappaB/NNalph a/NN,/,a/DTprincipal/JJcytoplasmic/JJinhibitor/NNof/INNF _kappaB/NN./. STIMULATE IL_5/NNalso/RBstimulates/VBZGTP/NNbinding/VBGto/T Op21ras/NN./. Overexpression/NNof/INprotein/NNkinase/NNC_zeta/NNs timulates/VBZleukemic/JJcell/NNdifferentiation/NN./. AB/LS_/:Interleukin_7/NN(/(IL_7/NN)/)stimulates/VBZthe /DTproliferation/NNof/INnormal/JJand/CCleukemic/JJB/N Nand/CCT/NNcell/NNprecursors/NNSand/CCT/NNlymph ocytes/NNS./. One hundrend sentences like the above were used as the set of positive examples and one hundrend non-causal sentences were used as the set of negative examples. The influence of the number of words of the sentences taken into account for learning on the confusion matrix was studied. Some indicative results are shown below. There were 9 ties of which 5 (55.56%) were correctly resolved Confusion Matrix: Y N -------------Y | 83 17 N | 34 66 9 words 150/200 (0.750000), of which 27 exact matches There were 5 ties of which 2 (40.00%) were correctly resolved Confusion Matrix: Y N -------------Y | 84 16 N | 34 66 5 words 10 words 154/200 (0.770000), of which 78 exact matches There were 13 ties of which 5 (38.46%) were correctly resolved Confusion Matrix: Y N -------------Y | 82 18 N | 28 72 150/200 (0.750000), of which 25 exact matches There were 3 ties of which 3 (100.00%) were correctly resolved Confusion Matrix: Y N -------------Y | 84 16 N | 34 66 6 words 146/200 (0.730000), of which 52 exact matches There were 17 ties of which 5 (29.41%) were correctly resolved A graphical representation of some results is given in Figs 1 and 2. Causal Sentences Confusion Matrix: Y N -------------Y | 78 22 N | 32 68 7 words 158/200 (0.790000), of which 41 exact matches There were 13 ties of which 7 (53.85%) were correctly resolved 100 95 90 85 80 75 70 65 60 55 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Confusion Matrix: Y N -------------Y | 86 14 N | 28 72 8 words 149/200 (0.745000), of which 27 exact matches From 5 to 27 tags/line Figure 1 [2] Non-Causal Sentences 100 [3] [4] 95 [5] 90 85 8 75 0 [6] 70 [7] 65 60 55 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 From 5 to 27 tags/line Figure 2 The general trend of the results was that the best performance was obtained at length of consideration at around 7 words. That was not very promising since the corpus sentences had maximum length of about 30 words.The reasons for this phenomenon are being investigated. Another important observation was that among the 100 positive examples very few patterns repeated themselves. This means that the automatic induction of multilevel grammars [1] would be rather problematic. VI. CONCLUSION A top-down approach to the determination of grammar engineering methods for the construction of Natural Language Grammars for Question Answering Systems based on Intelligent Text Mining for Biomedical applications was presented in the present paper. Our proposal is to start with the formulation of the ultimate task goal such as specialized question answering and derive from it the specifications of the grammar engineering methods such as hand crafting or machine learning. An illustrative example of a Text Grammar is given and a pilot experiment towards Memory Based Induction using Timpl is described. The results obtained so far render problematic the automatic induction of multilevel grammars. References [1] Girju, R. (2003). Automatic Detection of Causal Relations for Question Answering, Proceedings of ACL-2003 Workshop on Multilingual Summarization and Question Answerting. http://ilk.uvt.nl/software.html http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ Kontos, J. (1992). ARISTA: Knowledge Engineering with Scientific Texts. Information and Software Technology, Vol. 34, No 9, pp 611-616. Kontos, J., Elmaoglou, A., Malagardi, I. (2002). ARISTA Causal Knowledge Discovery from Texts. Proceedings of the 5th International Conference DS 2002. Luebeck, Germany, November 2002, Springer Verlag. pp. 348-355. Kontos, J., Malagardi, I., Peros, J. (2003a). “The AROMA System for Intelligent Text Mining” HERMIS International Journal of Computers mathematics and its Applications. Vol. 4. pp.163-173. LEA. Kontos, J., Malagardi, I., Peros, J. (2003b). “The Simulation Subsystem of the AROMA System” HERCMA 2003, Proceedings of 6th Hellenic European Research on Computer Mathematics & its Applications Conference. Athens.