KONTOS-MALAGARDI-PEROS

advertisement
QUESTION ANSWERING AND Rhetoric
ANALYSIS of Biomedical Texts in the
AROMA System
 JOHN KONTOS, IOANNA MALAGARDI and JOHN PEROS
Abstract--Question answering with intelligent knowledge
management of biomedical texts and analysis of rhetoric
relations in the AROMA system is presented. The
development of the AROMA system aims at the creation of an
intelligent tool for the support of the discovery and adaptation
of biomedical models based on data extracted from natural
language texts. The system operation includes three main
functions namely question answering and text mining and
simulation. The question answering function generates model
based answers and their explanations. The operation of
AROMA allows the exploitation of rhetoric relations between
a “basic” text that proposes a model of a biomedical system
and parts of the abstracts of papers that present experimental
findings supporting the model. An important use of AROMA
concerns the comparison of experimental data with the model
proposed in the basic text. The AROMA system consists of
three subsystems. The first subsystem extracts knowledge
including rhetoric relations from biomedical texts. The second
subsystem answers questions with causal knowledge extracted
by the first subsystem and generates explanations using
rhetoric relation knowledge in addition to other knowledge.
The third subsystem simulates the time-dependent behavior of
a model from which textual descriptions of the waveforms are
generated automatically.
Index Terms-- rhetoric relations, intelligent biomedical text
mining, knowledge discovery from text, simulation, question
answering, explanation, p53, mdm2.
I. INTRODUCTION
The new AROMA (Automatic Rhetoric Organizer for
Model Analysis) system is presented in the present paper
and an illustrative example of application is described. This
system is an intelligent computer tool for question
answering and text mining [5] of biomedical knowledge
including the management of rhetoric knowledge. Parts of
the older version of the system were presented in [6], [7],
[8], [9], [10].
 ARTIFICIAL INTELLIGENCE GROUP
DEPARTMENT OF PHILOSOPHY AND HISTORY OF SCIENCE
NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS
E-mail: ikontos2003@yahoo.com,imal@gsrt.gr
The development of the AROMA system aims at the
creation of an intelligent tool for the support of the
discovery and adaptation of biomedical models based on
data extracted from natural language texts. The system
operation includes two main functions namely question
answering and text mining. The question answering
function generates model based explanations of the
answers. The expanded operation of AROMA allows the
exploitation of rhetoric relations between a basic text that
proposes a model of a biomedical system and parts of the
text of papers that present experimental data supporting the
model.
In a typical application of the system a theoretical text
presenting the model of a system under scientific
investigation is taken as the “basic” one and a computerized
method is used for the analysis of the relationship of the
model to the texts of the papers that provide experimental
support of the model or theory proposed by it. Text mining
is applied both to the basic paper and the supporting papers
in order to extract the knowledge fragments that have to be
rhetorically related and computationally processed. An
important aspect of the application of the system is the
answering of natural language questions for the computer
aided comparison of new experimental data with the
predictions of the model proposed in the “basic” paper. The
friendly man-machine interface of the system interface is
capable of answering such questions and generating
explanations that help the user in tracing the support of the
model by experimental and background domain knowledge.
It is envisaged that the AROMA System may prove useful
for the support of the discovery activity of research
scientists with the intelligent management of scientific
knowledge mined from texts. Text mining differs from data
mining [2] in that it uses unstructured texts for the
collection of knowledge rather than structured sources like
databases. We define intelligent text mining as the process
of creating novel knowledge from texts that is not stated in
them by combining sentences with deductive reasoning. An
early implementation of this kind of intelligent text mining
was reported by us in [4]. A review of different kinds of
text mining including intelligent text mining is presented in
[11].
There are two possible methodologies of applying
deductive reasoning to texts. The first methodology is based
on the translation of the texts into some formal
representation of their “content” with which deduction is
performed [12]. The advantage of this methodology
depends on the simplicity and availability of the required
inference engine but its serious disadvantage is the need for
reprocessing all the texts and storing their translation into a
formal representation every time something changes in their
domain. In the case of scientific texts what may change is
some part of the background knowledge such as the
ontology used for deducing new knowledge. The second
methodology eliminates the need for translation of the texts
into a formal representation because an inference engine
capable of performing deductions “on the fly”, i.e. directly
from the original texts, is implemented.
names that occur in the biomedical texts that we use for the
present applications. In this way causal relations between
entities and processes are expressed. A lexicon containing
words such as causal verbs and stop words are used by this
subsystem. An output file is produced by the system that
contains parts of sentences collected from the original
sentences of different input texts. These output file is used
for reasoning by the second subsystem. The input files
used for this subsystem in the example of p53-mdm2
dynamics contain texts downloaded from MEDLINE. The
operation of the subsystem is based on the recognition of
noun phrases and verb groups and their relations.
B. The Causal Reasoning Subsystem
A disadvantage of this second methodology is that a more
complex inference engine than the one needed by the first
one must be built. The implementation of such inference
engines has however proved feasible for causal knowledge.
The strong advantage of the second methodology is that the
translation into a formal representation is avoided. In [4] we
proposed the second methodology and the method we
developed was therefore called ARISTA i.e. Automatic
Representation Independent Syllogistic Text Analysis. The
ARISTA method is used in the AROMA system for its
basic reasoning functions.
An early attempt was also made by us in [4] to implement a
model-based question answering system using the scientific
text as a knowledge base describing a qualitative model.
One of our early application examples concerned medical
text mining related to human respiratory system mechanics
[4]. Biomedical text mining is now recognized as a very
important field of study [3] and [17] particularly for
molecular biology. The system GeneWays [16] is a typical
example of a biomedical text analysis system for the
extraction of molecular pathway data. The idea of
combining text mining with simulation and question
answering was pursued further by our group as reported in
[7], [8], [9] and [10] as well as in the present paper.
The general architecture of our AROMA system consists of
three subsystems namely the Knowledge Extraction
Subsystem, the Causal Reasoning Subsystem and the
Simulation Subsystem. These subsystems are briefly
described below.
II. THE GENERAL ARCHITECTURE OF THE AROMA
SYSTEM
A. The Knowledge Extraction Subsystem
This subsystem integrates partial causal knowledge
extracted from a number of different texts. This knowledge
is expressed in natural language using causal verbs such as
“regulate”, “enhance” and “inhibit”. These verbs usually
take as arguments entities such as protein names and gene
The output of the first subsystem is used as input to the
second subsystem that combines background knowledge
with causal knowledge in natural language form to produce
by automatic deduction conclusions not mentioned
explicitly in the input text. The operation of this subsystem
is based on the ARISTA method [4] and results in the
recognition of causal relations on-the-fly of the form
“causes(process1, entity1, process2, entity2, manner)”.
The pair (process1, entity1) stands for the cause, the pair
(process2, entity2) stands for the effect and “manner”
stands for the kind of causality i.e. whether it is positive or
negative. The sentence fragments containing causal
knowledge are analyzed and the entity-process pairs are
recognized. The user questions are analyzed and reasoning
goals are extracted from them. The answers to the user
questions are generated automatically by a reasoning
process together with explanations in natural language
form. This is accomplished by the chaining “on the fly” of
causal statements using background knowledge such as an
ontology to support the reasoning process.
C. The Simulation Subsystem
The third subsystem is used for modelling the dynamics of
the biomedical system specified in the “basic” text. The
characteristics of the model such as structure and parameter
values are extracted from the input texts combined with
background knowledge such as ontology and default
process and entity knowledge [9]. Considering the p53mdm2 example two coupled first order differential
equations were used as the approximate mathematical
model of the biomedical system in rough correspondence
with the model proposed in [1]. A basic characteristic of the
behaviour of such a system is the occurrence of oscillations
for certain values of the parameters of the equations. Two
finite difference equations that approximate the differential
equations system of the model are:
Δx= a1*x + b1*y + c1*x*y
Δy= a2*y + b2*delay(d,x)
(1)
(2)
where Δx means the difference between the value of the
variable x at the present time and the value of the variable x
at the next time instant and delay(d,x) at equation (2) stands
for the value that x had d units of time before present time.
Time is taken to advance in discrete steps. The symbols x
and y are the variables that represent the concentrations of
the proteins p53 and mdm2 respectively. The symbols a1,
b1, c1, a2, b2 are the coefficients of the equations that
represent the parameters of the biomedical system. It is
noted that multiplicative term c1*x*y renders equation (1)
non-linear. This non-linearity causes the appearance of the
oscillations to differ from simple sine waves. The solution
of these equations is accomplished with a Prolog program
that provides an interface for manipulating the parameters
of the model. An important module of the simulation
subsystem is one that generates text describing the
behaviour of the variables of the model on true. The above
information concerning the connection of the parts of the
differential equations with the parts of the biomedical
system being modelled is formalized using rhetorical
relations explained below.
“parbib” and “entfra” are of the exr kind and “behrel” and
“tfollows” are of the mbr kind and are defined below.
The formal representation of the rhetoric relations used by
our system consists of a single predicate
rr(PAR_1,…,PAR_n) where the arguments PAR_1 to
PAR_n stand for n parameters that define a rhetoric relation
between two rhetorically related “objects”. These objects
may be of different kinds such as sentences or other text
fragments, equations, equation variables, physical
quantities, citations, references or parts of waveforms
representing the time behavior of model variables. The
formal representation:
rr(relation_name,relation_kind,
first_object_kind,first_object_identifier,
second_object_kind,second_object_identifier).
is used below for the illustration of some rhetorical
relations used by our system.
a. Internal Relations (inr)
III. THE RHETORIC RELATIONS
Research on the rhetoric or discourse analysis of biomedical
texts has started only recently [15]. In [15] an annotation
scheme is proposed for a rhetorical analysis of biology
articles using the “zoning” method. This method
characterizes parts of a scientific text using an annotation
scheme to identify these parts or “zones”.
In our system we apply the rhetoric relations approach [13,
14] in contrast with the zoning approach. This approach
connects parts of sentences and other textual fragments
using rhetoric relations.
About 50 rhetoric relations have been theoretically defined
in [13] but only 4 were computationally defined in [14]
namely “contrast”, “cause-explanation-evidence”,
“elaboration” and “condition” that were defined at a much
coarser level of granularity for practical reasons. It is
however proposed to enrich this approach of analysis by
using a few more rhetoric relations necessary for the
representation of the content of scientific texts related to
models of systems.
We distinguish between two kinds of purely textual rhetoric
relations namely internal to the “basic” paper (symbolized
as “inr”) and external to it (symbolized as “exr”). A third
kind of relations (symbolized as mbr) is proposed that
formalizes the appearance of the waveforms of the time
behaviour of the model variables. These waveforms are
treated as “numerical narratives” with a “rhetoric” structure
of events equivalent to the “pattern structure” of the
waveforms where peaks and valleys or maxima and minima
play the role of events. All these kinds of relations are
briefly described below and illustrated using examples from
[1] where “coerel” and “varrel” are of the inr kind,
1) The relation name “coerel” stands for relations between a
coefficient “c” of a mathematical model and the
corresponding entity property or parameter “ep” of the
biomedical system modeled.
Or formally: “rr(coerel,inr,coefficient,c,parameter,ep).
2) The relation name “varrel” stands for relations between a
variable “x” of an equation of the model and a physical
entity “e” or an entity property “ep” of the biomedical
system modeled.
Or formally: “rr(varrel,inr,variable,x,entity,e) and
“rr(varrel,inr,variable,x,entityp,ep).
b. External Relations (exr)
3) The relation name “parbib” stands for relations between
a parameter “p” of the biomedical system and a
bibliographic reference “r” to a paper that contains
experimental data that support the inclusion and possibly
the numerical value of this parameter.
Or formally: rr(parbib,exr,parameter,p,reference,r).
4) The relation name “entfra” stands for relations between
an entity “e” of a biomedical system and a text fragment
labeled “f” of a reference text labeled “r” and is symbolized
by “r_f” that presents the experimental data that support the
inclusion and possibly the numerical values of the
properties of the entity “e”.
Or formally: rr(entfra,exr,ent,e,fragment,r_f).
c. Model Behavior Relations (mbr)
5) The relation name “behrel” stands for relations between
some coefficient “c” of the model the time behavior of a
variable “v” of the model.
Or formally: “rr(behrel,mbr,coefficient,c, variable,v).
6) The relation name “tfollows” stands for time ordering
relations between a part in position “wp1” of a waveform
representing the time variation of a variable v1 of the model
such as a peak or a valley symbolized as wp1_v1 and a part
in position “wp2” of a waveform representing the time
variation of variable v2 and symbolized as wp2_v2. The
kind of object “waveform part” is symbolized as “wpart”.
extracted from the sentence: “Here the coefficient p1
denotes the rate of p53-independent mdm2 transcription and
translation (24)”
The abstracts of two papers presenting experimental
data supporting the qualitative model proposed by [1]
named with the labels “32” and “92” can be used to
illustrate the question answering process of the AROMA
system and the use of external rhetoric relations in the
explanations generated by the question answering process.
The first abstract labeled as “32” as listed in [1]
consists of six sentences from which two are selected by the
first subsystem of AROMA and from which the following
two sentence fragments are extracted automatically:
1.“The p53 protein regulates the mdm2 gene”
Or formally:
rr(tfollows,mbr,wpart,wp1_v1,wpart,wp2,p2_v2)
IV.
SOME RHETORIC RELATIONS IN THE P53-MDM2
EXAMPLE
In the p53-mdm2 example the rhetoric relations extracted
concern the text [1] that proposes a mathematical model for
the interaction of the proteins p53 and mdm2 as well as the
MEDLINE abstracts of papers related to the model
proposed by [1] either used or not as references by [1].
These abstracts were downloaded from MEDLINE and
contain knowledge of experimental results concerning the
interaction of the proteins p53 and mdm2. These proteins
are involved in the life cycle of the cell and interact through
a negative feedback system. Some rhetoric relations found
in the text [1] are:
r1:rr(coerel,inr,coefficient, sourcep53,
parameter,synthesis_rate_of_the_p53_protein)
extracted from the sentence: “Here the coefficient sourcep53
specifies the synthesis rate of the p53 protein.”
r2:rr(coerel,inr,coefficient,activity,parameter,p53's_sequenc
e-specific_DNA_binding activity”
extracted from the sentence: “The coefficient activity can
include p53's sequence-specific DNA binding activity”
r3:rr(varrer,inr,variable, degradation(t),entity,rate of
degradation)
extracted from the sentence: “The variable degradation(t)
measures the rate of degradation”
r4:rr(coerel,inr,coefficient, p1,entity, rate_of_p53independent_mdm2_transcription)
and
r5:rr(parbib,exr,entity, mdm2,reference,26)
2.“regulates both the activity of the p53 protein”
These fragments are then automatically transformed to
Prolog facts in order to be processed by the second
subsystem as shown below:
t(“32_5”, “The p53 protein regulates the mdm2 gene”).
t(“32_6”, “regulates both the activity of the p53 protein”).
The labels 32_5 and 32_6 denote that these fragments are
extracted from the sentences 5 and 6 of the text with label
32.
The second abstract labeled by the number “92” that was
found independently of [1] consists of seven sentences from
which two are selected by the first subsystem of AROMA
from which the following two sentence fragments are
extracted automatically:
3.“The mdm2 gene enhances the tumorigenic potential of
cells”
4.“The mdm2 oncogene can inhibit p53_mediated
transactivation”
and expressed in the form of Prolog facts as:
t(“92_3”, “The mdm2 gene enhances the tumorigenic
potential of cells”).
t(“92_7”, “The mdm2 oncogene can inhibit p53_mediated
transactivation”).
The labels 92_3 and 92_7 denote that these fragments are
extracted from the sentences 3 and 7 of the text with label
92.
Using the above sentence fragments of the p53-mdm2
example our system discovers the causal negative feedback
loop by appropriate chaining of the relevant sentence
fragments and represents it as:
p53 +causes mdm2 -causes p53
the EXPLANATION is:
Where
since <oncogene> is a kind of <gene> p53 protein -causes
p53
+causes means “causes increase” and
-causes means “causes decrease or inhibition”.
because
This is effected by answering the question:
p53 protein +causes gene of mdm2
Is there a process loop of p53?
and
This question is internally represented as the Prolog goal:
oncogene of mdm2 -causes p53 mediated transactivation of
p53.
“cause(P1,p53,P2,p53,S)”
where P1 and P2 are two process names that the system
extracts from the texts and characterize the behavior of the
entity p53. S stands for the overall effect of the feedback
loop found i.e. whether it is a positive or a negative
feedback loop. In this case S is found equal to “-” since a
positive causal connection is followed by a negative one.
The short answer automatically generated by our system is:
Yes.
The loop is p53 activity –causes p53 production.
By a short answer we mean a simple answer not connected
to any explanation of the reasoning followed for the
derivation of the answer.
The long answer automatically generated by our system
together with an appropriate explanation is as follows:
It should be noted that the combination of sentence
fragments (92_7) and (32_5) in a causal chain that forms a
closed negative feedback loop is based on two facts of
background knowledge.
This background knowledge is inserted manually in our
system as Prolog facts and can be stated as:
the DEFAULT entity of <p53_mediated> is <p53> or
default(p53_mediated, p53).
<oncogene> is a kind of <gene> or kind_of(oncogene,
gene).
The above analysis of the text fragments of the example is
partially based on the following background linguistic and
domain knowledge which is manually inserted as Prolog
facts:
Linguistic Knowledge:
The QUESTION is:
cause(P1,p53,P2,p53,S).
kind_of(“the”,“determiner”).
kind_of(“is”,“copula”).
kind_of(“of”,“preposition”)
kind_of(“activated”,“causal_connector”).
kind_of(“inhibits”,“causal_connector”).
kind_of(“regulated”,“causal_connector”).
USING INFERENCE RULE IR4a
Domain Knowledge:
since the DEFAULT entity of <p53_mediated> is <p53>
kind_of(“protein”,“entity”).
kind_of(“DNA”,“entity”)
kind_of(“p53”,“entity”).
kind_of(“Mdm2”,“entity”).
kind_of(“damage”,“process”).
kind_of(“expression”,“process”).
kind_of(“increase”,“process”).
kind_of(“activity”,“process”).
“Is there a process loop of p53 ? ”
Represented internally in Prolog as:
with rhetoric relations:
rr(entfra,exr,entity,p53,fragment,92_7)
rr(entfra,exr,entity,mdm2,fragment,92_7)
USING INFERENCE RULE IR4b
with rhetoric relations:
rr(entfra,exr,entity,p53,fragment,32_5)
rr(entfra,exr,entity,mdm2,fragment,32_5)
The above knowledge base fragment contains both
linguistic and domain knowledge to support the analysis of
the sentences occurring in the corpus. In practice these two
parts of knowledge are handled differently by the inference
rules of the reasoning module.
V. AUTOMATIC DESCRIPTION OF THE BEHAVIOR OF THE
MODEL VARIABLES
The system can generate automatically descriptions of the
the time behaviour of the concentration of the two proteins
that are represented by the two variables of the model using
the “tfollows” rhetoric relation. More details may be found
in [9] an [10]. An example is shown below of an
automatically produced description of the numerical results
produced by the solution of the model equations where T
stands for time and protein names capitalized as P53 and
MDM2 for displaying emphasis only:
“A peak of P53 at T=3.6 P53=28830 is followed by a peak
of MDM2 at T=6.4 MDM2=16550 which is followed by a
valley of P53 at T=8.6 P53=-6100 which is followed by a
valley of MDM2 at T=14.6 MDM2=9360 which is
followed by a peak of P53 at T=17 P53=670”.
VI. CONCLUSIONS
In the present paper we presented the question answering
function of our AROMA system with intelligent knowledge
management and rhetoric analysis of biomedical texts
related to the modeling of a biomedical system. The
AROMA system that we have developed consists of three
main subsystems. The first subsystem achieves the
extraction of knowledge from texts that is related to the
structure and the parameters of the biomedical system
simulated. The second subsystem is based on a reasoning
process that answers questions by combining causal
knowledge extracted by the first subsystem with
background knowledge and generates explanations of the
reasoning followed. The third subsystem is a system
simulator written in Prolog that generates the time behavior
of the model’s variables. An important feature of the system
presented here is its ability for model based non-factoid
question answering with the use of rhetoric relation
recognition and intelligent causal knowledge extraction
from scientific texts with explanation generation and
automatic generation of textual descriptions of the dynamic
behavior of the model of a biomedical system.
References
[1]
[2]
[3]
[4]
[5]
Bar-Or, R. L. et al (2000). Generation of oscillations by the p53Mdm2 feedback loop: A theoretical and experimental study. PNAS,
vol. 97, No 21 pp. 11250-11255.
Barrera J. et al (2004). An environment for knowledge discovery
in biology. Computers in Biology and Medicine, vol 34 pp 427-447.
Hirshman L. et al (2002). Accomplishments and challenges in
literature data mining for biology. Bioinformatics, vol. 18, 12, pp
1553-1561.
Kontos, J. (1992). ARISTA: Knowledge Engineering with
Scientific Texts. Information and Software Technology. vol. 34, No
9, pp 611-616.
Kontos, J. and Malagardi, I. (1999). Information Extraction and
Knowledge Acquisition from Texts using Bilingual Question-
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
Answering. Journal of Intelligent and Robotic Systems, vol 26, No.
2, pp. 103-122.
Kontos, J. and Malagardi, I. (2001). A Search Algorithm for
Knowledge Acquisition from Texts. Proceedings of HERCMA 2001,
5th Hellenic European Research on Computer Mathematics & its
Applications Conference. Athens.
Kontos, J. et al (2002a). ARISTA Causal Knowledge Discovery
from Texts Discovery Science 2002 Luebeck, Germany.
Proceedings of the 5th International Conference DS 2002 Springer
Verlag. pp 348-355.
Kontos, J. et al (2002b). System Modeling by Computer using
Biomedical Texts. Res Systemica, volume 2, Special Issue, October
2002. (http:/www.afscet.asso.fr/resSystemica/accueil.html).
Kontos J. et al (2003a). The Simulation Subsystem of the
AROMA System HERCMA 2003. Proceedings of 6th Hellenic
European Research on Computer Mathematics & its Applications
Conference, Athens.
Kontos J. et al (2003b). The AROMA System for Intelligent Text
Mining. HERMIS, vol 4, pp 163-173.
Kroetze J. A. et al (2003). Differentiating Data- and Text-Mining
Terminology. Proceedings of SAICSIT 2003, pp 93-101.
Kuehne S. E. and Forbus K. D. (2004). Capturing QP-relevant
Information from Natural Language Text. Proceedings of the 18th
International Workshop on Qualitative Reasoning, August, Evanston,
Illinois, USA.
Mann, W. C. and Thompson, S. A. (1988). Rhetorical structure
theory: toward a functional theory of text organization. Text, 8(3),
pp 243-281.
Marcu, D. & Echihabi, A. (2002). An unsupervised approach to
recognizing discource relations. ACL 2002.
Mizuta, Y. & Collier N. (2004). An Annotation Scheme for
Rhetorical Analysis of Biology Articles. Proceedings of the Fourth
International Conference on Language Resources and Evaluation
(LREC 2004). Lisbon, Portugal.
Rzhetsky, A. et al (2004). GeneWays: a system for extracting,
analyzing, visualising and integrating molecular pathway data.
Journal of Biomedical Informatics. vol 37, pp. 43-53.
Shatkay, H. and Feldman R. (2003). Mining the Biomedical
Literature in the Genomic Era: An Overview. Journal of
Computatinal Biology, 10 (3) 821-855.
Download