B.3.1. J.F. Aldana Montes, R. Berlanga

advertisement
A Pilot Study Experience on Generating
Complex Ontology Instances from
Scientific Bibliographies on Real
Biological Domains
José F. Aldana-Montes1, Rafael Berlanga-Llavorí2, Roxana Danger2, Raul MontañésMartínez3, Mª del Mar Rojano-Muñoz1, Francisca Sánchez-Jiménez3
1
Khaos Research Group. Dpt. of Computer Languages and Computing Science,
University of Malaga, Campus de Teatinos. 29071 Malaga, Spain
http://Khaos.uma.es, E-mail: jfam@lcc.uma.es
2 Temporal
Knowledge Bases Group. Dpt. of Computer Languages and Computer
Systems, University Jaume I, Castellón, Spain
http://www3.uji.es/~berlanga/; E-mail: {berlanga, Roxana.Danger}@uji.es
3
ProCel Lab. Department of Molecular Biology and Biochemistry. Faculty of
Sciences, University of Málaga, Campus de Teatinos. 29071 Malaga, Spain
http://www.bmbq.uma.es/procel, E-mail: kika@uma.es
Abstract
In this paper we present a first insight into the generation of complex ontology instances
from scientific bibliographies like the one in PubMed/PubChem on a real biological
domain: Polyamines and Histamine. There is evidence for the involvement of both in
cancer and other inflammation- and/or angiogenesis-dependent diseases, but multiple
questions concerning the molecular processes behind these effects still remain to be
solved.
We address the problem of automatically generating ontology instances starting from a
collection of PDF documents stored in a bibliographic database. We have developed a
domain ontology which models and describes what we are searching for. The structure
of the document is extracted in order to generate a mapping between the ontology and
the document text. Using this mapping the ontology is populated with the extracted
knowledge.
1. Introduction
The Semantic Web (SW) is intended to enrich the current Web by creating
knowledge objects that allow both users and programs to better exploit the Web
resources [1]. The cornerstone of the Semantic Web is the definition of an ontology
that conceptualizes the knowledge within the resources of a particular domain [2].
Thus, the contents of the Web will be described in the future by means of a large
collection of semantically tagged resources that must be reliable and meaningful.
Nowadays most of the information on the Web consists of text documents with little
or no structure at all, which makes their manual annotation impracticable [3].
In automatic document classification several methods have been proposed to
automatically classify documents according to a given taxonomy of concepts (e.g.
[4]). However, these methods require a large amount of training data, and they do
not generate instances of the ontology. In the Information Extraction (IE) area, much
work has been focused on recognizing predefined entities (e.g. dates, locations,
names, etc.), as well as extracting relevant relations between them by using natural
language processing. However, current IE systems are restricted to extracting flat
and very simple relations, mainly to feed a relational database. Ontology
relationships are more complex as they can involve nested concepts and they have
associated inference rules.
For these reasons we address the problem of annotating texts according to an
ontology as a mapping problem between text fragments and ontology subgraphs. As
in IE systems, we also recognize named entities (e.g. protein names) that can be
directly associated to ontology concepts. However, unlike IE systems, we do not use
either syntactic or semantic analysis to extract the relations appearing in the text. As
a result, we can extract complex instances efficiently and effectively.
In this paper, we will apply this technique to extract instances from the biogenic
amines domain and all the related processes.
2. Generation of ontology instances
In this section we introduce the main formal aspects of the proposed extraction
technique presented in [2]. We adopt the formal definition of ontology from [6]. This
definition distinguishes between the conceptual schema of the ontology, which
consists of a set of concepts and their relationships, and its associated resource
descriptions, which are called instances. In our approach, we represent both parts
as oriented graphs, over which the ontology inference rules are applied to extract
and validate complex instances.
Table 1 presents the definitions of the concepts used when generating partial and
complete instances from words and entities appearing within a text fragment.
Definition
Formal description
Description
Def. 1.
- The set of the ontology concepts C is the subset of nodes in N that are not pointed by any arc
labeled as “instance_of”.
An ontology is a labeled oriented
graph G = (N, A) where the set of
nodes N contains both the concept
names and the instance identifiers,
and A is a set of labeled arcs
between them representing their
relationships.
-The taxonomy of concepts C consists of the subgraph that only contains the is_a arcs. It defines a
partial order over the elements of C, denoted with C.
-The relation signature is a function : R  C x C, which contains the subsets of arc labels that
involve only concepts.
-The function dom: R  C with dom(r) = 1((r)) gives the domain of a relation rR, whereas
range: R  C with range(r) = 2((r)) gives its range.
-The function domC: C  2 gives the domain of definition of a concept.
Additionally, over the graph we
introduce the following restrictions
according to [6]:
-The function in order to express the proximity of two concepts c and c’ where dT(c, c’) and dR(c,
c’) are the distances between the two concepts c and c’.
Rel (c, c' )  1- log C min {d T (c,c' ), d R (c,c' ) }
Def. 2.
We denote with
I
c
o
an instance related to the object o of the class c or simply an instance, as the
subgraph of the ontology that relates the node o with any other, and there exists an arc (o, instance_of,
c)  A.
I o  { (o, r , o' ) /(o, instance_of, c )  A, r  R, (r )  (c * , c' ),
c
o  domc (c * ), o' domc (c' ), c *  C c,  r ' , c *  C dom(r ' )  C c}
Notice that the arc (o, instance_of,
c) is not included in the instance
subgraph. An instance is empty if
the object o only participates in the
arc (o, instance_of, c), which is
represented as the set {(o, *, *)}
Def. 3.
c
I
We call a specialization of the instance
o
c c '
I o o' , the following
to the class c’, ccc’, denoted with
instance:
c'
c
I o '  { (o' , r , x) /(o, r , x)  I o , r ' R, r  R r ' , dom(r ) C c' , range(r )  cx ,
.
represent an overridden property
x  domC (cx ), o' domC (c' )}
Def. 4.
I
We call an abstraction of the instance
c'
c
o
Similarly to the previous definition,
we can obtain an abstract instance
by selecting all the triples whose
relation name can be abstracted to a
relation of the target class c’ and
renaming accordingly the instance
object.
cc'
to the class c’, c’cc, denoted with
I o  o ' , the instance
c
I o'  {(o' , r, x) /(o, r, x)  I o , r, cx , dom(r ) C c' , range(r )  cx , x  dom(cx )}
Def. 5:
c
c'
c
We call union of the instances I1 and I 2 and denote it by I1
o'
o
o
c
ˆ I2
I1 o 
c'
o'
I c  I
 1 coc ' 2
 I 1 o o '

c
  I1 o
c'

 I 2 o'
 c ' c
 I 2 o 'o
c'
Def. 6:
We call difference of the instances
c
I 1 o ˆ I 2
c'
o'
c
c
c 'c

I  I
  1 oc 2 oc''o c

 I 1 o  I 2 o 'o
c
I1 o
ˆ I 2 c ' the set:

o'
I 1 o  {( o,*,*)}  I 2
if
o'
c'
c
o'
 (o' ,*,*)  (c  c c'  c'  c)
I2
if
I2
if
I 1 o  {( o,*,*)}  c  c c'
if
I 1 o  {( o' ,*,*)}  c'  c c
and
c'
, denoted with
o'
I2
c  c c'
if
c'  c c
We call symmetric difference of the instances
 {( o' ,*,*)}  c'  c c
c
if
Def. 7:
o'
o'
c
Definitions 5-7 are necessary to
define
the
unification
and
aggregation of simple instances
into complex ones.
 {( o' ,*,*)}  c  c c'
if
c'
Basically, this definition says that
an instance can be specialized by
simply renaming the object with a
name from the target class in all the
instance triples that do not
c
c
c'
I1 o ˆ I 2 o'
, the following set:
c
c'
c'
I1 o and I 2 o ' , denoted with I1 o ˆ I 2 o' the result of
c
c'
ˆ I 2 c' ˆ I1 c .
I1 o ˆ I 2 o' 
o'
o
Def. 8:
We say that two instances
c
I1 o
and
c'
I 2 o'
related to the objects o and o’ of classes c and c’,
respectively, c C c’, are complementary if they satisfy at least one of the following two conditions:
1.
c c'
relations
x  x’, r is biyective) and
((o, r, x), (o’, r’, x’) I1
c'
c
ˆ I 2 , r R r’, r  r’, x  x’) or

o
o'
at least one of the instances
I1 o
c
2.
c'
I1 o o'  I 2 o' ,
((o’, r, x), (o’, r, x’) 
I 2 o'
is empty.
c'
c
say
c'
that
c c '
two
c 'c '
ˆ I2
I o  I1 o o 
o 'o
u
complementary
u
Def. 10:
We say that two instances,
.
c'
or
Def. 9:
We
Two instances are complementary
if contradictions do not exist
between the values of their similar
instances
I1 o
and
I 2 o' are
unifiable
in
, c C c’.
u
c
c'
I1 o and I 2 o'
c*
are aggregable in
c c *
I o  I1 o o  {(o, r , o' ' )} , if
rR, (r) = (c*, c’’), c C c*, c’’C c’, and o’’ is the name of the instance
c ''
c 'c ''
I o ''  I 2 o 'o '' .
Tabla 1. Definitions of the extraction method taken from [2]
Taking into account the definitions above, the system generates the set of instances
that are associated to each text fragment as follows. Firstly, it constructs all the
possible partial instances with the concept, entity and relation names that appear in
the text fragment. Then it tries to unify partial instances, substituting them by the
unified ones. Finally, instances are aggregated to form complex instances. The
following algorithm sums up the process. Further details of this process are given in
[2].
3. Generating Complex Ontology Instances from Scientific Bibliographics on the
Domain of the Biogenic Amines
3.1 The Domain of the Biogenic Amines Metabolism
From a molecular point of view, Biology is an informational science with three major
types of interconnected information: i) the structure of genes, along with the threedimensional structures of proteins, ii) the molecular machines of life, and iii)
biological systems with their emergent behaviour. Validation of new
biocomputational tools for biological data mining and integration requires experience
on a wide body of biological knowledge, which is essential: i) to select a proper and
interesting problem, ii) to build the underlying ontology, iii) to purge any automatic
searching result, and to build new applications (i.e.: hypothesis and conclusions) on
the obtained data. In this project, we have tried to accomplish all of the above
mentioned requirements to build a tool for location of molecular information by text
mining in both web pages and PDF documents. Our group combines experts in
computer science and molecular biology, involved in interdisciplinary projects for the
last 4 years. As a pilot topic for validation, we chose a module of the secondary
metabolism of living cells "the biogenic amine metabolism".
Biogenic amines are derived from amino acids. Polyamines play a relevant
regulatory role in macromolecular synthesis and cell proliferation rates [7-9]. On the
other hand, histamine is a biogenic amine synthesized by the enzyme histidine
decarboxylase (HDC) with a major role in immune response, inflammation, and
other multiple physiological activities [10-11]. All together biogenic amines are
involved in many of the most important and general diseases, such as neurological
disorders, cancer, and immune pathologies. Traditionally, histamine and polyamines
have been the subject of much interest by many biomedical research groups without
much communication among them. However, they share similar chemical and
metabolic facts and physiopathological missions [8-11]. Nevertheless, multiple
questions concerning the molecular processes behind these effects remain to be
solved. This may be due to the great dispersion of available data in specialized
literature of very different areas of biomedicine. For all of these reasons, this field
seems to be worthy of selection as an interesting conceptual benchmark. In silico
predictive models have been developed and published by the group at different
levels that help to explain many of the previous experimental observations, giving
rise to new findings and hypothesis [10,12,13]. Development of new text mining
tools is essential for efficiency in the progression of these projects.
3.2 Domain Ontology
Enzymes (mainly proteins) are the catalysts of the biological reactions that take part
in any metabolic pathway. The half-life of a given enzyme is determined by the
balance between its synthesis and degradation rates. Protein degradation involves
many different mechanisms (proteinase activities) specific for each enzyme. Many of
the regulatory mechanisms of amine metabolic pathways involved enzyme
destruction after the biological mission of their respective amine has been carried
out, as a mechanism to ensure an effective switch-off of the process. In addition,
several amine-related enzymes are synthetized in an unactive form that needs to be
cleavaged by proteinases to reach its active stage (processing /maturation). This is
the case for mammalian histidine decarboxylase, a very unstable enzyme that
requires a proteolytic mechanism for both maturation and regulation of its turnover.
A lot of information on the enzyme degradation/processing mechanisms involved in
amine metabolism has been generated in different text formats. An important
percentage of this information has been generated by collaborator groups and our
own group, especially in the case of mammalian histidine decarboxylase [14-17].
This seems to be a good test bank for our pilot project on text mining. Thus, we
adopted HDC as the pilot molecule to contrast our data mining efforts to develop an
automatized searching tool on published pdfs and other document formats.
Nevertheless, due to the highly consistent nature of protein metabolism, once the
data mining process is validated on this enzyme, it could be easily scaled-up to any
other enzyme, related or not to amine metabolism. This ontology has been
developed in OWL [18].
Fig. 2. Ontology
3.3 Benchmark
The benchmark has been chosen from an extensive pool of scientific works coming
from a world-wide group of experts on degradation and maturation of histidine
decarboxylase (HDC). In this benchmark 67 documents are compiled in PDF format
and eight web pages, 50 percent of them conform with the pertinent information
(highly-related to HDC degradation/maturation). Both collateral (related to other facts
of histamine metabolism) and unrelated information which were included as controls,
are essential for a correct and accurate validation of both the analyzers and the
ontology during this pilot experience. Document selection was done on the basis of
the impact and availability of the publication source. Selected documents content
dispersed (and in some cases contradictory) experimental results developed in the
field of the molecular biology and biochemistry.
Text description and figure of instances
<Sample id="004" source="22.pdf" section="Title and Abstract">
<P>Expression of 74-kDa histidine decarboxylase protein in a macrophage-like
cell line RAW 264.7 and inhibition by dexamethasone</P>
Total words
Instances
Correct Instances
Failed Instances
17
5
4
1
<Sample id="010" source="9.pdf" section="2.1.1">
<P>
The human H1 receptor contains 486, 488 or 487 amino acids in rat, mouse and
human respectively. It shares the typical features of GPCR, namely: seven
transmembrane domains of 20-25 amino-acids predicted to form an a-helice that
spans the plasma membrane and an extracellular NH2 terminal (polypeptide)
domain with glycosylation site. It is encoded by a single exon gene located on
the distal short arm of chromosome 3p25 in humans and chromosome 6 in mice.
Histamine binds to aspartate residues in the transmembrane domain (motif) 3 of
the receptor and to asparagine + lysine residues within the transmembrane
domain 5. H1 receptors are involved in the pathological process of allergy such
as allergenic rhinitis, conjunctivitis, atopic dermatitis, urticaria, asthma
and anaphylaxis. In the lung, it mediates bronchoconstriction and increased
vascular permeability. The H1R is expressed in numerous cell types, including
airway and vascular smooth muscle cells, hepatocytes, chondrocytes, nerve
cells, endothelial cells, neutrophils, monocytes, dendritic cells, as well as T
and B lymphocytes, in which it mediates the various biological manifestations
of allergic responses [3-5]. H1R is a Gaq/11-coupled protein (polypeptide)
with a very large third intracellular loop and a relatively short C-terminal
(polypeptide) tail (Fig. 1). The main signal induced by ligand binding is the
activation of phospholipase C-generating inositol 1,4,5-triphosphate (Ins
(1,4,5) P3) and 1,2-diacylglycerol leading to increased cytosolic (cellular
lication) Ca2+ [4]. This rise in intra-cellular calcium levels seems to account
for the various pharmacological activities promoted by the receptor, such as
nitric oxide production, vasodilatation, liberation of arachidonic acid from
phospholipids and increased cyclic AMP.H1R also activates NF kB through Gaq11
and Gbg upon agonist binding, while constitutive activation of NF kB occurs
through Gbg only [6].
</P>
Total words
Instances
Correct Instances
Failed Instances
283
14
10
4
Table 2. Representation of total correct instances (orange diamond in figure) of text and the fails
instances (gray diamond).
Concepts, links and instances were defined on the basis of scientific criteria without
taking into account the precise benchmark-containing information or semantic style.
One of the major difficulties detected throughout the work was the semantic
dispersion in the expression of concepts and results in the biochemical information,
which make the accurate definition of the concepts and instances for the mining
difficult, and obliged to the establishment of an enriched list of synonyms. Working
in this proving bank we obtain a set of results as shown in the examples in table 2.
The results indicate that concepts contained in the text are properly detected;
although the system fails to detect the appropriate context, the results are not
discouraging at all; however, further efforts will be necessary to restrict possibilities
to establish long-distance relationships (with respect to the text position of the
concepts) and to solve problems with ambiguity of the biological terms (for instance,
in literature, H1 could mean location of a histidine residue at position 1 as well as
the abbreviation for a histamine receptor sybtype). These changes would lead to
increase the astringency of the search criteria, and consequently to the loss of some
information, although accuracy should be improved
The extraction algorithm captures the main relationships between correctly detected
entities, but it mainly fails in the correct identification of some basic entities. This is
principally because we have used preliminary regular expressions to detect named
entities instead of a good thesaurus. Another problem we identify is the correct
tokenization of text fragments due to the large number of chemical formula and
abbreviations that are included in biomedical texts. Future work must then focus on
improving the most basic mechanisms for detecting named entities and concept
boundaries in the texts. Finally, we have also noticed that the level of detail of the
ontology also affects the quality of extracted instances. Thus, the more detailed the
ontology is, the likelier and better are the relations found in text fragments. In this
respect, much more work must be done to improve the details of the target ontology.
4. Conclusion
We have presented a pilot project on the generation of complex ontology instances
in the very real (and thus hard) domain of the biogenic amines metabolism (BAM).
An ontology which modelizes the discussion domain of the BAM has been
developed in OWL. We have built a benchmark from the real bibliography of the
BAM domain (extracted from several public domain bibliographic databases) and
we are going to use this pilot experience to evaluate in a well-known controlled
environment the quality of the results produced by our system. Firsts results are not
absolutely discouraging but further work must be done in the description of the
ontology.
Acknowledgements
This work was supported by Grants CVI-267, CVI-657, TIN-09098-C05-01 and
TIN2005-09098-C05-04.
References
[1] Berners-Lee, T., Hendler, J., Lassila, O. "The Semantic Web". Scientific American, 2001.
[2] Danger, R., Sanz, I., Berlanga-Llavori, R., Ruiz-Shulcloper, J. “A Proposal for the Automatic
Generation of Instances from Unstructured Text”, Lecture Notes in Computer Science, Volume 3287,
pp. 462 – 469, 2004.
[3] Gruber, T.R. “Towards Principles for the Design of Ontologies used for Knowledge Sharing”,
International Journal of Human-Computer Studies Vol. 43, pp. 907-928, 1995.
[4] Forno, F., Farinetti, L., Mehan, S. “Can Data Mining Techniques Ease The Semantic Tagging
Burden?”, SWDB 2003, pp. 277-292, 2003.
[5] Doan, A. et al. “Learning to match ontologies on the Semantic Web”. VLDB Journal 12(4), pp. 303319, 2003.
[6] Maedche, A., Neumann, G. and Staab, S. “Bootstrapping an Ontology based Information Extraction
System”. Studies in Fuzziness and Soft Computing, Springer, 2001.
[7] Cohen SS. Aguide to polyamines. Oxford University Press, New York; 1998.
[8] Thomas T, Thomas TJ. Polyamines in cell growth and cell death: molecular mechanisms and
therapeutic applications. Cell Mol Life Sci. 2001; 58: 244–58.
[9] Medina MA, Urdiales JL, Rodriguez-Caso C, Ramirez FJ, Sanchez-Jimenez F. Biogenic amines and
polyamines: similar biochemistry for different physiologi cal missions and biomedical applications. Crit
Rev Biochem Mol Biol. 2003; 38: 23–59.
[10] Moya-Garcia AA, Medina MA, Sanchez-Jimenez F. Mammalian histidine decarboxylase: from
structure to function. Bioessays 2005; 27: 57–63.
[11] Medina MA, Quesada AR, Nunez de Castro I, Sanchez-Jimenez F. Histamine, polyamines, and
cancer. Biochem Pharmacol. 1999; 57: 1341–4.
[12] Medina M.A. Correa-Fiz F., Rodríguez-Caso C., Sánchez-Jiménez F. A comprehensive view of
polyamine and histamine metabolism to the light of new technologies. J. Cell. Mol. Med. 2005. Vol. 9.
854-864.
[13] Rodríguez-Caso C., Montañez R., Cascante M., Sánchez-Jiménez F., Medina M.A. Mathematical
modeling of polyamine metabolism in mammals. J. Biol. Chem. en prensa (Mayo,06), on-line PMID:
16709566.
[14] Ángel N., Olmo M.T., Coleman C.S., Medina M. A., Pegg A.E., Sánchez-Jiménez F. Experimental
evidence for structure/activity features in common between mammalian. Biochem. J. 1996. Vol. 320.
365-368.
[15] Matés J.M., del Valle A.E., Urdiales J.L., Coleman C.S., Feith D., Pegg A.E., Sánchez-Jiménez
F.Structure/function relationship studies on the T/S residues 173-177 of rat ODC. Biochem. Biophys.
Acta 1998. Vol. 1386. 113-120.
[16] Olmo M.T. Urdiales J.L., Pegg A.E., Medina M.A., Sánchez-Jiménez F.. Eur. J. Biochem. 2000. Vol.
267. 1527-1531.
[17] Olmo M.T., Sánchez-Jiménez F., Medina M.A., Hayashi H. Spectroscopic análisis of recombinant rat
histidina decarboxylase. 2002. Vol.132. 433-439.
[18] OWL Web Ontology Language 1.0 Reference. W3C Candidate Recommendation 15 Decembre 2003
http://www.w3.org/TR/owl-ref.
Download