GRAMMARS FOR QUESTION ANSWERING SYSTEMS BASED ON

advertisement
GRAMMARS FOR QUESTION ANSWERING
SYSTEMS BASED ON INTELIGENT TEXT
MINING IN BIOMEDICINE
 JOHN KONTOS, JOHN LEKAKIS, IOANNA MALAGARDI and JOHN PEROS
Abstract--A top-down approach to the determination of
grammar engineering methods for the construction of Natural
Language Grammars for Question Answering Systems based
on Intelligent Text Mining for Biomedical applications is
presented in the present paper. Our proposal is to start with
the formulation of the ultimate task goal such as specialized
question answering and derive from it the specifications of the
grammar engineering methods such as hand crafting or
machine learning. The task goal we have experience in
involves syllogistic inference with scientific texts and models in
biomedicine. An important kind of knowledge that we deal
with is causal knowledge expressed either linguistically or
formally. The linguistic expressions we deal with involve
causal verbs that connect noun phrases that denote entityprocess pairs. An illustrative example of a Text Grammar is
given and a pilot experiment towards Memory Based
Induction using TiMBL is described. The results obtained so
far render problematic the automatic induction of multilevel
grammars.
Index Terms-- Question Answering, Intelligent Text Mining,
Biomedical Texts, Biomedical Models.
There are two possible methodologies of applying
deductive reasoning on texts. The first methodology is
based on the translation of the texts into some formal
representation of their “content” with which deduction is
performed. The advantage of this methodology depends on
the simplicity and availability of the required inference
engine but its serious disadvantage is the need for
preprocessing all the texts and storing their translation into
a formal representation every time something changes. In
particular in the case of scientific texts what may change is
some part of the prerequisite knowledge such as ontology
used for deducing new knowledge. The second
methodology eliminates the need for translation of the texts
into a formal representation because an inference engine
capable of performing deductions “on the fly” i.e. directly
from the original texts is implemented. The only
disadvantage of this second methodology is that a more
complex inference engine than the one needed by the first
one must be built. The strong advantage of the second
methodology is that the translation into a formal
representation is avoided. In [4] the second methodology is
chosen and the method developed is therefore called
ARISTA i.e. Automatic Representation Independent
Syllogistic Text Analysis.
I. INTRODUCTION
A top-down approach to the determination of grammar
engineering methods for the construction of Natural
Language Grammars for Question Answering Systems
based on Intelligent Text Mining for Biomedical
applications is presented in the present paper. Our proposal
is to start with the formulation of the ultimate task goal
such as specialized question answering and derive from it
the specifications of the grammar engineering methods such
as hand crafting or machine learning. The task goal we have
experience in involves syllogistic inference with scientific
texts and models in biomedicine.
We define Intelligent Text Mining as the process of
creating novel knowledge from texts that is not stated in
them by combining sentences with deductive reasoning.
 ARTIFICIAL INTELLIGENCE GROUP
DEPARTMENT OF PHILOSOPHY AND HISTORY OF SCIENCE
NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS
E-mail: ikontos2003@yahoo.com,imal@gsrt.gr
The ultimate task goal of the applications of grammars that
we are interested in is the development of systems for
knowledge extraction and processing with intelligent text
mining of biomedical texts. An important kind of
knowledge that we deal with is causal knowledge expressed
either linguistically or formally. The linguistic expressions
we deal with involve causal verbs that connect noun
phrases that denote entity-process pairs. In certain cases it is
hoped that the knowledge obtained in this way leads to
giving novel answers to the questions posed to the system
accompanied by appropriate automatically generated
explanations [4], [5], [6], [7].
The ultimate goal of a Question Answering system based
on Intelligent Text Mining is to extract a meaningful
answer from the underlying text sources. Answer extraction
depends on the complexity of the question, on the answer
type provided by question processing, on the actual texts
where the answer is searched, on the search method and on
the question focus and context.
The requested information might be present in a variety of
heterogeneous data sources, that must be searched by
different retrieval methodologies. Several collections of
texts might be organized into separate structures, with
different index operators and retrieval techniques.
Moreover, additional information may be organized in
databases and models or may simply reside on the Web, in
HTML or XML format. The context could be represented in
different formats, according to the application but the
search for the answer must be uniform in all sources either
in textual or other formats and must be identified uniformly.
To enable this process different indexing and retrieval
techniques for Q&A must be employed and methods of
detecting overlapping information as well as contradictory
information must be developed. Moreover, answers may be
found in existing knowledge bases or ontologies.
During the answer extraction process itself at least three
problems arise. The first one is related to the assessment of
answer correctness, and implicitly to ways of justifying its
correctness. The second problem relates to the method of
knowing whether all the possible correct answers have been
found throughout the text sources. The third problem is that
of generating a coherent answer when its information is
distributed across a document or throughout different
documents. An extension of this problem is the case when
special reasoning must be generated prior to the extraction
of the answer as the simple juxtaposition of the answer
parts is not sufficient but the answer must be derived
deductively.
To justify the correctness of an answer, explanations, in the
most natural format, should be made available.
Explanations may be based on a wide range of information,
varying from evidence enabled by natural language texts to
commonsense and pragmatic knowledge, available from
on-line knowledge bases. A back-chaining mechanism
implemented into a theorem prover may generate the
explanation for an answer in the most complex cases, and
translate it into a form easily and naturally understood. In
the case of textual answers, textual evidence is necessary
for the final justification of the answer.
II. THE TOTAL QUESTION ANSWERING TASK
The AROMA (ARISTA Oriented Modeling Adaptation)
system for Intelligent Text Mining is the implementation of
a representative solution of the total Question Answering
task. AROMA accomplishes knowledge discovery using
our knowledge representation independent method
ARISTA that performs causal reasoning “on the fly”
directly from text in order to extract qualitative or
quantitative model parameters as well as to generate an
explanation of the reasoning. The application of the system
is demonstrated by the use of two biomedical examples
concerning cell apoptosis and compiled from a collection of
MEDLINE paper abstracts related to a recent proposal of a
relevant mathematical model. A basic characteristic of the
behavior of such biological systems is the occurrence of
oscillations for certain values of the parameters of their
equations. The solution of these equations is accomplished
with a Prolog program. This program provides an interface
for manipulation of the model by the user and the
generation of textual descriptions of the model behavior.
This manipulation is based on the comparison of the
experimental data with the graphical and textual simulator
output. This is an application to a mathematical model of
the p53-mdm2 feedback system. The dependence of the
behavior of the model on these parameters may be studied
using our system. The variables of the model correspond to
the concentrations of the proteins p53 and mdm2
respectively. The non-linearity of the model causes the
appearance of oscillations that differ from simple sine
waves. The AROMA system for Intelligent Text Mining
consists of three subsystems as briefly described below.
The first subsystem achieves the extraction of knowledge
from individual sentences that is similar to traditional
information extraction from texts. In order to speed up the
whole knowledge acquisition process a search algorithm is
applied on a table of combinations of keywords
characterizing the sentences of the text.
The second subsystem is based on a causal reasoning
process that generates new knowledge by combining “on
the fly” knowledge extracted by the first subsystem. Part of
the knowledge generated is used as parametric input to the
third subsystem and controls the model adaptation.
The third subsystem is based on system modeling that is
used to generate time dependent numerical values that are
compared with experimental data. This subsystem is
intended for simulating in a semi-quantitative way the
dynamics of nonlinear systems. The characteristics of the
model such as structure and parameter polarities of the
equations are extracted automatically from scientific texts
and are combined with prerequisite knowledge such as
ontology and default process and entity knowledge. The
AROMA system may answer a variety of questions some
Question
1. Causal
antecedent
2. Causal
consequence
3. Enablement
4.Instrumental/
Procedural
Abstract Specification
What caused some event to occur?
What state or event causally led to an event or
state?
What are the consequences of an event or state ?
What causally unfolds from an event or state ?
What object or resource enables an agent to
perform an action ?
What instrument or body part is involved when a
process is performed ?
kinds of which are briefly specified in the following
illustrative Taxonomy.
III. CAUSAL KNOWLEDGE IN NATURAL LANGUAGE TEXTS
An important kind of knowledge that we deal with is causal
knowledge expressed either linguistically or formally. The
linguistic expressions we deal with involve causal verbs
that connect noun phrases that denote entity-process pairs.
The texts which we are interested in analyzing when
studied are found contain causal knowledge that appears in
the form of information delivered both "directly" and
"indirectly". Sentences delivering causal knowledge
indirectly are in most cases embedded sentences, the main
verb being a metascience operator (e.g. discovered,
observed, has shown how, found that, demonstrate that,
argued that, explain how etc.).
By "causal knowledge", thus, we mean any form of
knowledge relating to the interaction of objects/entities and
events or processes in some model, i.e. knowledge about
how processes related to certain entities result from or lead
to processes associated with other or the same entities. This
knowledge is mainly expressed in the form of causal
relations. In terms of the "state" - "event"(/"process") pair,
the types of causal links dealt with in the example text are
described as events/processes causing events/processes or
conditions (states or events) allowing some causal link.
In natural language, "causality" may be expressed in a
variety of linguistic forms. The linguistic forms expressing
the parts of causal relations involve, among others, clausal
adverbial subordination, clausal adjectival subordination,
and discourse connection. A causal relation, typically, is a
pair consisting of an "antecedent" and a "consequent".
Antecedent and consequent may appear in two connected
sentences or in terms of some causal implication. In
passivised sentences the by-phrase introduces the
antecedent of the causal relation. Antecedents of causal
relations may also be identified in by-phrases associated
with past participles.
Some verbs and participles may function as mere causal
connectives (e.g. cause), in contrast to others (e.g. increase.)
which may have some additional meaning. Processes,
which function as antecedents or consequents of some
causal relation, can also be referred to in terms of
nominalizations in causal-relation-delivering sentences. In
dealing with sentential constituents -rather than complete
sentences- which may deliver one of the parts of some
causal relation, we should note that adverbial modification
may also provide material qualifying for such
antecedents/consequents. Simple causal relations acquired
from texts may be recognized by means of a four-argument
predicate with the arguments:
(a) the consequent process related to some entity in the
physical system
(b) the entity in the physical system involved in the
consequent
(c) the antecedent process related to a similar (or the same)
entity
(d) the entity involved in the antecedent,
The automatic extraction of the appropriate terms which
will fill in the process-and-entity slots in the "cause"
predicate involves the following steps:
First, phrases qualifying for antecedents/consequents are
identified. Due to the variety of linguistic forms encoding
causal relations, as indicated above, there may be more than
one causal relation encoded in one sentence.
Second, the terms representing each process-entity structure
of the antecedent and consequent are identified.
Third, after the identification of a process the entity which
undergoes this process must fill in the entity-slot of the
same antecedent/consequent "process-entity" structure. In
cases where there are two entity nouns, in a phrase,
qualifying for some antecedent/consequent, a choice is
made between these entity nouns on the basis of some
additional features.
There is a further point to be made concerning the
identification of the entity noun which is to fill in an entityslot, namely, the recovery of a node related to some
process, when the former is absent from the text. The
linguistic items constituting the entity-process structures, in
complex causal relations, have to be recovered from a
number of complex linguistic expressions used in the text.
Some of the complex linguistic phenomena which may be
involved in the recovery of such items are anaphora and
deixis, noun modification, coordination, clausal
subordination such as relative, temporal, conditional and
concessive clauses.
IV. AN ILLUSTRATIVE EXAMPLE OF A GRAMMAR
An Illustrative example of a Text Grammar is given below
in a BNF-like form.
Sentence Structure
s:- np cc np
Noun Phrase Structures
np:- det np | q np | adj np | det adj np.
np:- np pp | pnp pp.
pnp:-d prn.
Propositional Phrase Structures
pp:- prep np | prep pp.
Process Entity Pairs
pe:- np prn agp | prs ens | prs def.
Entity Process Pairs
ep:- ens prs | np.
Aspirin/NNinhibits/VBZnuclear/JJfactor_kappa/NNB/NNm
obilization/NNand/CCmonocyte/NNadhesion/NNin/INstim
ulated/VBNhuman/JJendothelial/JJcells/NNS./.
Process Strings
prs:- vb prs | adj proc | prn pp | proc np an np an | q np adj
prn | np adj prn |
np perm prs | prvn np perm | prvn det prn | prn prn | det
prn | prvn(P).
The/DTantioxidant/JJpyrrolidine/NNdithiocarbamate/NN(/(
PDTC/NN)/)specifically/RBinhibits/VBZactivation/NNof/I
Nnuclear/JJfactor_kappa/NNB/NN(/(NF_kappa/NNB/NN)/
)./.
PREVENT
prvn:-proc(X) | prn(X).
Entity Strings
ens:-pp | np | prep np | adj noun
Causal Connectors
cc:-“causes” | “induces” | “inhibits” | “as” | copula “caused
by” | “forces” | “opposes”.
A/DTspecific/JJinhibitor/NNof/INcaspases/NNS,/,zVAD_F
MK/NN,/,prevents/VBZtriptolide_induced/JJPARP/NNdeg
radation/NNand/CCDNA/NNfragmentation/NNbut/CCnot/
RBgrowth/NNarrest/NN./.
Mutation/NNof/INthe/DTcore/NNETS/NNbinding/NNsite/
NNfrom/IN_GGAA_/NNto/TO_GGAT_/NNprevents/VBZ
the/DTbinding/NNof/INETS_like/JJfactors/NNSwith/INthe
/DTexception/NNof/INETS1/NN./.
PRODUCE
V. A PILOT EXPERIMENT TOWARDS MEMORY BASED
GRAMMAR INDUCTION
A pilot experiment towards Memory Based Grammar
Induction using the TiMBL system [2] is described. The
corpus used is the Genia corpus [3] used by the biocreative
European competition. Causal sentences containing causal
verbs were collected as positive examples. The causal
verbs used are verbs like cause, induce, inhibit etc.
Illustrative causal sentences containing such verbs are listed
below in the POS tagged form found in the Genia corpus.
Dexamethasone/NNproduced/VBDa/DTrapid/JJand/CCsust
ained/JJincrease/NNin/INglucocorticoid/NNresponse/NNel
ement/NNbinding/NNand/CCa/DTconcomitant/JJ40_50/C
D%/NNdecrease/NNin/INAP_1/NN,/,NF/NNkappa/NNB/N
N,/,and/CCCREB/NNDNA/NNbinding/NNthat/WDTwas/
VBDblocked/VBNby/INcombined/JJdexamethasone/NNan
d/CCcytokine/NNor/CCPMA/NNtreatment/NN./.
Targeted/VBNremodeling/NNof/INhuman/JJbeta_globin/N
Npromoter/NNchromatin/NNstructure/NNproduces/VBZin
creased/VBNexpression/NNand/CCdecreased/VBNsilencin
g/NN./.
CAUSE
REDUCE
TI/LS_/:Hypoxia/NNcauses/VBZthe/DTactivation/NNof/I
Nnuclear/JJfactor/NNkappa/NNB/NNthrough/INthe/DTpho
sphorylation/NNof/INI/NNkappa/NNB/NNalpha/NNon/INt
yrosine/NNresidues/NNS./.
Nitric/JJoxide/NNselectively/RBreduces/VBZendothelial/JJ
expression/NNof/INadhesion/NNmolecules/NNSand/CCpr
oinflammatory/JJcytokines/NNS./.
TI/LS_/:E2F_1/NNblocks/VBZterminal/JJdifferentiation/N
Nand/CCcauses/VBZproliferation/NNin/INtransgenic/JJme
gakaryocytes/NNS./.
INDUCE
IL_4/NNinduces/VBZCepsilon/NNgermline/NNtranscripts/
NNSand/CCIgE/NNisotype/NNswitching/NNin/INB/NNcel
ls/NNSfrom/INpatients/NNSwith/INgammac/NNchain/NN
deficiency/NN./.
Binding/NNof/INactivated/VBNplatelets/NNSinduces/VBZ
IL_1/NNbeta/NN,/,IL_8/NN,/,and/CCMCP_1/NNin/INleuk
ocytes/NNS./.
INHIBIT
The/DTincrease/NNof/INcortisol/NNis/VBZimmunosuppre
ssive/JJand/CCreduces/VBZGR/NNSconcentration/NNboth
/CCin/INnervous/JJand/CCimmune/JJsystems/NNS./.
PROMOTE
Ligand_induced/JJtyrosine/NNphosphorylation/NNof/INthe
/DTSTATs/NNSpromotes/VBZtheir/PRP$homodimer/NNa
nd/CCheterodimer/NNformation/NNand/CCsubsequent/JJn
uclear/JJtranslocation/NN./.
Tax/NNexpression/NNpromotes/VBZN_terminal/JJphosph
orylation/NNand/CCdegradation/NNof/INIkappaB/NNalph
a/NN,/,a/DTprincipal/JJcytoplasmic/JJinhibitor/NNof/INNF
_kappaB/NN./.
STIMULATE
IL_5/NNalso/RBstimulates/VBZGTP/NNbinding/VBGto/T
Op21ras/NN./.
Overexpression/NNof/INprotein/NNkinase/NNC_zeta/NNs
timulates/VBZleukemic/JJcell/NNdifferentiation/NN./.
AB/LS_/:Interleukin_7/NN(/(IL_7/NN)/)stimulates/VBZthe
/DTproliferation/NNof/INnormal/JJand/CCleukemic/JJB/N
Nand/CCT/NNcell/NNprecursors/NNSand/CCT/NNlymph
ocytes/NNS./.
One hundrend sentences like the above were used as the set
of positive examples and one hundrend non-causal
sentences were used as the set of negative examples. The
influence of the number of words of the sentences taken
into account for learning on the confusion matrix was
studied. Some indicative results are shown below.
There were 9 ties of which 5 (55.56%) were correctly
resolved
Confusion Matrix:
Y
N
-------------Y | 83 17
N | 34 66
9 words
150/200 (0.750000), of which 27 exact matches
There were 5 ties of which 2 (40.00%) were correctly
resolved
Confusion Matrix:
Y
N
-------------Y | 84 16
N | 34 66
5 words
10 words
154/200 (0.770000), of which 78 exact matches
There were 13 ties of which 5 (38.46%) were correctly
resolved
Confusion Matrix:
Y
N
-------------Y | 82 18
N | 28 72
150/200 (0.750000), of which 25 exact matches
There were 3 ties of which 3 (100.00%) were correctly
resolved
Confusion Matrix:
Y
N
-------------Y | 84 16
N | 34 66
6 words
146/200 (0.730000), of which 52 exact matches
There were 17 ties of which 5 (29.41%) were correctly
resolved
A graphical representation of some results is given in Figs 1
and 2.
Causal Sentences
Confusion Matrix:
Y
N
-------------Y | 78 22
N | 32 68
7 words
158/200 (0.790000), of which 41 exact matches
There were 13 ties of which 7 (53.85%) were correctly
resolved
100
95
90
85
80
75
70
65
60
55
50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Confusion Matrix:
Y
N
-------------Y | 86 14
N | 28 72
8 words
149/200 (0.745000), of which 27 exact matches
From 5 to 27 tags/line
Figure 1
[2]
Non-Causal Sentences
100
[3]
[4]
95
[5]
90
85
8
75
0
[6]
70
[7]
65
60
55
50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
From 5 to 27 tags/line
Figure 2
The general trend of the results was that the best
performance was obtained at length of consideration at
around 7 words. That was not very promising since the
corpus sentences had maximum length of about 30
words.The reasons for this phenomenon are being
investigated. Another important observation was that
among the 100 positive examples very few patterns
repeated themselves. This means that the automatic
induction of multilevel grammars [1] would be rather
problematic.
VI. CONCLUSION
A top-down approach to the determination of grammar
engineering methods for the construction of Natural
Language Grammars for Question Answering Systems
based on Intelligent Text Mining for Biomedical
applications was presented in the present paper. Our
proposal is to start with the formulation of the ultimate task
goal such as specialized question answering and derive
from it the specifications of the grammar engineering
methods such as hand crafting or machine learning. An
illustrative example of a Text Grammar is given and a pilot
experiment towards Memory Based Induction using Timpl
is described. The results obtained so far render problematic
the automatic induction of multilevel grammars.
References
[1]
Girju, R. (2003). Automatic Detection of Causal Relations for
Question Answering, Proceedings of ACL-2003 Workshop on
Multilingual Summarization and Question Answerting.
http://ilk.uvt.nl/software.html
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/
Kontos, J. (1992). ARISTA: Knowledge Engineering with
Scientific Texts. Information and Software Technology, Vol. 34, No
9, pp 611-616.
Kontos, J., Elmaoglou, A., Malagardi, I. (2002). ARISTA Causal
Knowledge Discovery from Texts. Proceedings of the 5th
International Conference DS 2002. Luebeck, Germany, November
2002, Springer Verlag. pp. 348-355.
Kontos, J., Malagardi, I., Peros, J. (2003a). “The AROMA System
for Intelligent Text Mining” HERMIS International Journal of
Computers mathematics and its Applications. Vol. 4. pp.163-173.
LEA.
Kontos, J., Malagardi, I., Peros, J. (2003b). “The Simulation
Subsystem of the AROMA System” HERCMA 2003, Proceedings of
6th Hellenic European Research on Computer Mathematics & its
Applications Conference. Athens.
Download