XML representations of pathway data: a comparison Lena Strömbäck

advertisement
XML representations of pathway data: a comparison
Lena Strömbäck
Department of Computer and Information Science
Linköpings Universitet
S-581 83 Linköping, Sweden
+46 13 28 2324
lestr@ida.liu.se
ABSTRACT
Standardisation and integration of pathway data is currently an
interesting topic within bioinformatics with several consortia, e.g.
SBML, PSI and BioPAX. These groups use or consider XML for
representation of their standards. Furthermore, XML is used by
many of the existing databases containing pathway information
for export and exchange of data. In this paper we compare some
of the XML representations used by the standardisation
committees and existing databases. The contribution of the paper
is the comparison and evaluation of the representations together
with a discussion on implications for information discovery and
integration.
Categories and Subject Descriptors
J.3. [LIFE AND MEDICAL SCIENCES]: Biology and Genetics
General Terms
Standardization, Languages
Keywords
Bioinformatics, data integration, XML, pathways, databases,
SBML, PSI MI
1. INTRODUCTION
Currently, research within biology rapidly generates new
knowledge on how genes, proteins and other substances interact.
Each piece of knowledge generated by an experiment gives one
small piece in the puzzle of how a cell works. However, due to the
complexity of this data there is a need for search and integration
of large datasets where the connections between each of these
small pieces can be discovered and thus allowing conclusions on
how larger parts of the puzzle are constituted. Today a large
number of databases containing information about protein and
gene interactions, for instance, KEGG [17] [19], BIND [2] [15],
MINT [24] [31], BioCyc [4], CSNDB [29], DIP [9] [26], and
SPAD [27], are available. All these databases have slightly
different purpose and thus the dataset within each of them
together with the properties stored for each item and the search
possibilities differ between the databases.
It would be of high importance for the user to combine and
compare information from several of these data sources as well as
to have tools for advanced searches in these large datasets. The
importance of this is shown by various research directions within
bioinformatics, e.g. integration of databases [10] [21] and creation
of general purpose tools for search and discovery [18] [28].
One key issue here is the possibility to supply data in formats
comparable to each other, which would be very important for both
integration of datasets and the possibility for search and discovery
of data [8]. There have been a number of proposals for such
exchange formats where the work by [1] and [23] give an
overview and evaluation. These evaluations show that XML [30]
is a format that has many advantages for representing pathway
information. This view is also supported by ongoing
standardisation efforts, e.g. SBML [11] [16] and PSI MI [13]
[25], which both are formats defined as specialisations of XML.
Standardisation work aiming at creating common terminology and
ontologies for representing pathway data, e.g. BioPAX [5] [6] and
GeneOntology [14], are also providing their specification and
tools in XML. Many of the databases mentioned above provide
their data for export in some variant of XML.
In this work we have compared the data provided in XML for
exchange from four different databases containing pathway
information. The data is provided both in the two proposed
standards SBML and PSI MI but sometimes also in an XMLbased format particular to a specific database. The purpose of the
evaluation is to compare the different formats with respect to
information representation, and expressivity but also to be able to
make conclusions on how well the formats are suited for
integration and information discovery.
The paper starts with a short overview of the databases that have
been studied, with focus on the purpose and information
contained in each database. We continue by giving an overview of
the studied XML-formats, presenting their structure and giving
examples fetched from the databases. The paper is concluded by
giving a comparison and recommendations for use of the formats.
2. STUDIED DATABASES
In this section we give a short overview of the four databases
KEGG [17] [19], BIND [2] [15], DIP [9] [26], and MINT [24]
[31] for which we have analysed the exchange formats. The main
effort of the presentation is to give a general view of the purpose
of and information available in the databases since this affects the
formats and examples presented in the next section.
The KEGG pathway database [17] [19] is one of several
molecular databases available in the KEGG framework. It
provides information on molecular and gene interaction. Pathway
information is provided by a set of reference maps, describing
general information about known pathways. The maps can be
specified for different species. The maps are clickable and provide
links to protein and gene information from other databases, such
as PDB [3]. The information is also provided in table format.
KEGG can be searched in different ways. The pathways can be
reached from a browsable list of available pathways. It is also
possible to search for maps based on proteins, reactions or
advanced search on combinations of these and other topics.
KEGG data is available in SBML and KGML, a format defined
for KEGG.
The BIND database [2] [15] contains information on interactions
between molecules and molecule complexes. The purpose of the
database is to allow for prediction and exploration of pathways. In
contrast to KEGG the information in BIND is structured around
so called interaction pairs, i.e. two interacting molecules. For each
interaction pair information about the molecules and the
interaction itself is stored. BIND contains detailed information
about molecule structure and experimental information related to
interactions.
BIND can be browsed based on, for instance, interaction,
molecular complex, a particular pathway or publication. It also
allows free text queries and it is possible to create complex
boolean queries on the available fields in the database.
BIND data is available in XML in a special BIND format.
The DIP database [9] [26] contains information about protein
interactions that have been experimentally determined. The data
contained in DIP have been curated both manually and
computationally to ensure a reliable dataset. The information in
DIP is composed of nodes and edges. A node represents a protein
participating in an interaction. It contains basic information such
as name and subcellular location and references to other
databases. The edges correspond to interactions and contain
information such as experimental methods and which regions of
the proteins are involved in the interaction.
The database can be searched for proteins based on names or
identifiers. Expressions can be combined by boolean operators.
When one or more proteins of interest are found, it is possible to
follow their interactions by clicking on links.
DIP data is available in PSI MI and DIP´s own format XIN.
Finally, the MINT database [24] [31] stores interactions between
biological molecules. In particular MINT contains experimentally
determined interactions where an enzyme is modifying one of the
molecules. The data in MINT is extracted from the scientific
literature and the scope is information on mammals. MINT
contains information about proteins, interactions and experiments.
Proteins are modeled as interactors and can be given a role in each
interaction.
MINT can be searched based on interactions or interactors. From
the result it is possible to follow links to other known interactions
or interactors. MINT data is available in the PSI MI format.
3. XML FOR REPRESENTING
PATHWAYS
In this section we describe five different XML formats defined for
exporting data from existing databases for molecular pathways.
The first two SBML [11] [16] and PSI MI [13] [25] have been
proposed as standards and are defined together in a joint effort of
several partners. The latter three formats are proposed as export
format for a particular database, but are due to their structure very
interesting to compare with the proposed standards.
3.1 SBML: Systems Biology Markup
Language
Systems Biology Markup Language (SBML) [11] [16] was
created by the Systems Biology Workbench Development group
in cooperation with representatives from many system and tool
developers within bioinformatics. It is a language which aim is to
serve as a future standard for information exchange in
computational biology and especially within molecular pathways.
The focus has been to create a format that allows models to be
encoded in XML. The standard’s main releases are called levels.
Currently level 2 is defined, its main features are described below
and there is ongoing work on level 3. There are already a number
of systems supporting SBML, mainly simulation tools, drawing
tools and databases.
In the current standard, level 2, a pathway is described as a model
and each model can contain the following features:
Compartment, which is a description of the container or
environment in which the reaction takes place. A model
normally consists of several compartments for instance
subcellular information. For a compartment it is possible to
define the size and how compartments are surrounded by
each other.
Species, which are the substances or entities that take part in
the reactions. In SBML species can be everything from a
simple ion, for instance a proton or an atom, through simple
molecules, for instance glucose, to large molecules such as
RNAs or proteins. For species it is possible to specify their
spatial size and charge. It is also possible to specify model
data, such as the initial concentration and amount and
whether this can change during the reaction.
Reactions, which are processes that changes one or more of
the species. The reaction can be a transformation, a transport
or a binding reaction. For a reaction its reactants, products
and modifiers are specified by giving references to the
relevant species. It is also possible to specify whether this
reaction is reversible and its speed by defining a kinetic law,
mathematically describing the reaction.
Events, which are descriptions of discrete changes in the
model. An example is that an event can describe that the
concentration of one of the species is halved when the
amount of some other species reaches a specific threshold.
For an event it is possible to specify what triggers the event,
time constraints and the result of the event.
In addition, a model can also contain definitions of
parameters,
mathematical
functions,
units
and
mathematical expressions. These are defined on the top
level and can be used when defining the other entities. This
allows for shorter and more readable descriptions in the rest
of the model.
SBML models are object-oriented, where all entities above are
subtypes of a most general type SBASE and where there are
subtypes for Species. Interesting is that the type SBASE contains
the fields note and annotation that allow for addition of user and
software specific information that is not contained in the rest of
the standard.
There is currently ongoing work on SBML level 3. This version
focuses on solving cases where reactions or species are divided in
<sbml …. >
<model id="aae00010" name="aae00010">
<listOfCompartments>
<compartment id="default" name="default"/>
<compartment id="uVol" name="uVol"
outside="default"/>
</listOfCompartments>
<listOfSpecies>
<species id="aldehyde …"
name="aldehyde dehydrogenase (NAD+)"
compartment="uVol"
initialAmount="0.0">
</species>
More species
</listOfSpecies>
<listOfReactions>
<reaction id="R00235" name="R00235"
reversible="true">
<listOfReactants>
<speciesReference species="Acetate">
</speciesReference>
</listOfReactants>
<entrySet …>
<entry>
<interactorList>
<proteinInteractor id="G_1">
<names>
<shortLabel/>
<fullName>bcl-2-associated protein x,
alpha splice form</fullName>
</names>
<xref>
<primaryRef db="DIP" id="232N"/>
<secondaryRef db="SWP" id="Q07812"/>
<secondaryRef db="PIR" id="A47538"/>
<secondaryRef db="GI" id="539664"/>
<secondaryRef db="RefSeq" id="NP_620116"/>
</xref>
<organism ncbiTaxId="9606">
<names>
<shortLabel/>
<fullName>Homo sapiens</fullName>
</names>
</organism>
</proteinInteractor>
More interactors follow here
<listOfProducts>
<speciesReference
species="Acetyl_minus_CoA">
</speciesReference>
</listOfProducts>
<listOfModifiers>
<modifierSpeciesReference
species="acetate_minus_…"/>
</listOfModifiers>
</reaction>
More reactions
</listOfReactions>
</model>
</sbml>
Figure 1. Example of a KEGG pathway in SBML.
specific entities. In particular there is a proposal for representing
reactions and species that occur in more than one container
simultaneously. There is also a proposal for representing
hierarchies of entities, and for describing species through
composition or graphs of entities and a proposal for generalizing
chemical reactions.
Currently, it is possible to retrieve data from KEGG and BioCYC
in SBML. Figure 1 shows how a KEGG pathway is represented in
SBML. The example has been shortened for the sake of brevity
and readability, but shows clearly how the different entities of
SBML are used for KEGG data.
3.2 PSI MI
The Proteomics Standards Initiative Molecular Interaction XML
format (PSI MI) [13] [25] was developed by the Proteomics
Standards Initiative, founded in 2002, as one initiative of the
Human Proteome Organisation (HUPO). The aim of the initiative
is to develop standards for data representation in proteomics to
facilitate data comparison, exchange and verification. As a first
step they have defined standards for protein-protein interaction
and mass spectrometry. Interesting for this work is the standard
for protein-protein interaction, the PSI MI format. The aim is to
extend the format to also include other types of molecules. The
</interactorList>
<interactionList>
<interaction>
<experimentList>
<experimentDescription id="DIP_1X">
<bibref>
<xref>
<primaryRef db="pubmed" id="91958"/>
</xref>
</bibref>
<interactionDetection>
<names>
<shortLabel>Experimental</shortLabel>
</names>
<xref>
<primaryRef db="DO" id="DO:0045"/>
<secondaryRef db="PSI" id="MI:045"/>
</xref>
</interactionDetection>
</experimentDescription>
</experimentList>
<participantList>
<proteinParticipant>
<proteinInteractorRef ref="G_1"/>
</proteinParticipant>
<proteinParticipant>
<proteinInteractorRef ref="G_2"/>
</proteinParticipant>
</participantList>
<xref>
<primaryRef db="DIP" id="1E"/>
</xref>
</interaction>
More interactions follow here.
</interactionList>
</entry>
</entrySet>
Figure 2. Example of DIP data in PSI MI format.
format is intended to be used for exchange of data, thus it is not
designed for efficient storage.
All data in PSI MI is structured around an entry. An entry
describes one or more interactions that can be considered as one
unit. The entry contains the following parts:
Source and availabilitylist, used for describing the source of
the data, for instance, an organization, and where the data
can be accessed, for instance a database or similar.
Experimentlist, which is a list of references to experiments
verifying an interaction. The list normally contains links to
publications.
Interactorlist, which is a list of proteins participating in the
interaction. For each interactor information about, for
instance, substructure can be defined.
Interactionlist, a list of the actual interactions. For each
interaction it is possible to set one or more names, the type of
interaction and also a database reference to more information
about the interaction. The participating proteins are
described by their names or references to the interactorlist. It
is also possible to set a confidence level for detecting this
protein in the experiment, the role of the protein and whether
the protein was tagged or overexpressed in the experiment. In
addition to this each interaction has a description of
availability and experiments which normally are references to
the lists above.
Attributelist, a possibility for the user to add further
information that does not fit into the entries above. This
feature can be used for the different entities above.
Figure 2 shows a part of data from the DIP database and shows
how DIP uses PSI MI for exporting their data. The example has
been shortened for the sake of brevity and readability.
<project ….>
<attributes>
<nodeAtt name="descr" type="text">
</nodeAtt>
<nodeAtt name="organism" type="text">
</nodeAtt>
<nodeAtt name="taxon" type="text">
</nodeAtt>
<edgeAtt name="class" type="text">
</edgeAtt>
<edgeAtt name="name" type="text">
</edgeAtt>
</attributes>
<node uid="DIP:232N" id="G:1"
name="BAXA_HUMAN" class="protein">
<feature name="SWP:Q07812" class="cref">
<src>SwissProt</src>
</feature>
<att name="descr">
<val>bcl-2-associated protein x, alpha
splice form</val>
</att>
<att name="taxon">
<val>9606</val>
</att>
<att name="organism">
<val>Homo sapiens</val>
</att>
</node>
More proteins follow
<edge uid="DIP:1E" id="G:3" from="G:1"
to="G:2" class="link">
<feature uid="DIP:1X" class="exp:s">
<src>PMID:9194558</src>
<val>Experimental</val>
</feature>
</edge>
3.3 XIN
More edges follow
XIN is used as an exchange format for exporting data from the
DIP database. The XIN format is very flexible. It is possible for
the user to define additional constraints and attributes and, thus,
tailor the format for each application. In the XIN format a
document consists of three parts or sections.
</project>
Attributes, this part defines the set of attributes that is used to
describe each node in the graph. For each attribute that the
user wants to use it is possible to define its name, its type and
a default value. If any of the attributes defined is not given a
value for a node or edge, its value is set to the specified
default value.
Nodes, describe the interactors in the network. The only
required information for a node is an identity, a name and a
class specifying which type of interactor, for instance protein,
this is.
Edges, the edges correspond to the interactions. Each edge is
defined by giving an identity, a class and the nodes that this
interaction concerns.
In addition to the attributes, the nodes and edges can also be given
a set of features specifying additional properties. In figure 3 we
show how the example fetched from DIP in PSI MI looks like in
XIN format. Note specifically how the example starts with
defining extra attributes to the node and the use of features to
Figure 3. Example of DIP data in XIN format.
make references to other databases. The example has been
shortened for readability.
3.4 BIND export format
The BIND database has its own XML exchange format for
exporting data from the database. This format is very verbose
mainly due to the fact that BIND stores a lot of detailed
information about the interacting molecules, i.e. information
about which parts of the molecules interact.
As the BIND database, BIND XML is structured around
interaction pairs determined by experiments. Each entry describes
the interaction, interacting molecules and publications. Data is
often repeated and little use is made of cross-references.
Compared to the other formats BIND XML has many levels of
tags that makes the format less readable for a human reader. The
example below shows an excerpt of a BIND XML file. In the
example we have removed a lot of data and many of the levels due
to space and readability. The main structure of the format is
centered around the interaction pairs (BIND-interaction) and the
interacting molecules (BIND-objects) located as subparts of the
interaction. This is exemplified in the XML code in figure 4. The
<BIND-Submit>
<BIND-Submit_interactions>
…..
<BIND-Interaction>
<BIND-Interaction_iid>
<Interaction-id>64786</Interaction-id>
</BIND-Interaction_iid>
<BIND-Interaction_a>
<BIND-object>
Identification and reference information
<BIND-object_descr>Chain B, Dna …
</BIND-object_descr>
</BIND-object>
</BIND-Interaction_a>
<BIND-Interaction_b>
Similar to the description of object a
</BIND-Interaction_b>
<BIND-Interaction_descr>
<BIND-descr>
<BIND-descr_simple-descr>Interaction
between 103D_B and 103D_A.
</BIND-descr_simple-descr>
<pathway name="path:map04070" org="map"
title="Phosphatidylinositol system"
image="http:…..gif"
link="http://www.genome.ad.jp/….">
<entry id="1" name="ec:3.1.3.67"
type="enzyme"
link="http…..1.3.67">
<graphics name="3.1.3.67"
type="rectangle" x="179"
y="660" width="45" height="17"/>
</entry>
Many more entries
<relation entry1="7" entry2="39"
type="ECrel">
<subtype name="compound" value="32"/>
</relation>
Many more relations
No reactions in this pathway.
</pathway>
Figure 5. Example of KEGG data in KGML.
Under which conditions does this interaction occur and
which parts of the molecules interact – omitted.
</BIND-descr>
</BIND-Interaction_descr>
Reference to publications and authors describing this
interaction – omitted
</BIND-Interaction>
More interactions
</BIND-Submit_interactions>
</BIND-Submit>
Figure 4. Example of BIND-data.
two entries BIND-Interaction_a and BIND-Interaction_b define
the two proteins interacting within the interaction.
3.5 KEGG Markup Language (KGML)
The KEGG Markup language (KGML) [20] is the XML-based
format used to export the KEGG graph objects, especially the
pathways. In KGML information about the graphical objects and
their relations in the drawings can be represented.
In KGML, exemplified in figure 5, the pathway is the top element
and each pathway corresponds to a pathway map in the KEGG
database. For each pathway name, organisation and identification
number are specified. It is also possible to specify an additional
title, a link to an image and a link to where more information
about this pathway can be found. The pathway consists of a list of
entries, reactions and relations.
Entry, this object contains information about a node of the
pathway. It must have a name, type and identifier. It may also
have a link to other resources, a link to a reaction and a link
to a map entry, an explanation and an identification number.
The entry often corresponds to a graphical element in the
map and in that case the coordinates and formats for this
attribute are defined for the entry. When the entry is a
complex node, the subtype component is used for the entry.
This subtype consists of a reference to the graphical
component that this object is a part of.
Relation, specifies a relationship between two proteins or
two ortholog groups. (Corresponds to an arrow between two
rectangles in the KEGG graph.) It is specified by giving the
identifiers of the two objects and the type of this relation. It
is also possible to specify more detailed information about
the interaction as subtypes.
Reaction, specifies a chemical reaction between a substrate
and a product. It corresponds to an arrow between circles in
the KEGG diagrams. For a reaction the attribute name and
type are defined. The substrate and product elements are
defined by giving identifiers of the entries.
4. COMPARISON OF THE FORMATS
Table 1 gives a summary and comparison of the main features for
the presented formats. The comparison is structured into six
different sections: The environment and usage of the format; The
representation of interactors, i.e. proteins or similar; The
representation of the interaction itself; Other information that can
be represented in the formats; Formal expressiveness and
possibilities of referencing.
For the environment SBML and PSI MI are both defined with the
aim of being general standards and are in use by existing systems.
There are existing tools available when working with the formats.
Considering the representation of interactors all formats have at
least one entity for representing subjects. In some of the formats,
i.e. PSI MI, KGML, and XIN, it is possible to further specify the
class of the interactor. Currently PSI MI and BIND allow for
specifying which parts of the molecules that participate in the
interaction. This is, however, a topic that is under consideration
for the next version of SBML where a proposal exists [11].
All the presented formats have one or more entities for
representing reactions with, as for interactors, a difference in level
of granularity. SBML has several subtypes for representing
reactions, while, for instance, PSI MI only have one. There is also
a difference in whether the interactors can be given roles and the
Table 1: Comparison of the presented XML-formats
Environment for the
specification:
- Inventors
- Existing tools
- Used by
SBML
PSI MI
XIN
BIND XML
KGML
Systems Biology
Workbench
Development group.
Proteomics Standards
Initiative.
DIP inventors.
BIND inventors.
KEGG inventors.
No current tools.
No current tools.
Tools for viewing and
analysis.
Used by the DIP
database.
Used by the BIND
database.
Tool for showing a
graph in KGML.
Interactor.
Node.
Objects.
Entry.
Proteins sequences
and sites can be
described.
No description of
parts of molecules
(possible to add in
user defined
attributes).
Detailed description
of reacting
substructure.
No representation of
parts of molecules.
Reaction.
Interaction.
Relation and reaction.
Each interactor can
be given a role.
Edge, can be further
specified by
subclasses.
Interaction pair.
Each reaction allows
interactors of three
predefined roles
reactants products or
modifiers.
No roles.
Unbounded number
of interactors.
No roles for
interactors.
No roles for the
relation, substrate and
product used for
reaction.
Tools for validation,
visualization and
conversion.
Used by simulation
systems, drawing
tools and database.
Representation of
interactors:
- Used notation
- Description of
parts of interactors.
Representation of
interaction:
- Used notation
- Role of interactor
- No of interactors
Species, with the
subtypes
transformation,
transport and binding.
Datasets available
from DIP and MINT.
More databases
accepts data in PSI
MI.
No current
representation of parts
of molecules but a
proposal exists for
next release of
SBML.
Two interactors.
Used by the KEGG
database.
Two interactors.
Two interactors.
Unbounded for each
role.
Other predefined
entities:
- Environment for
reaction
- Experiment data
- Math relations
Expressiveness
- Main structure
- Inheritance
- Definition of new
attributes and
entities
Compartment defined
as the environment
for reactions.
It is possible to define
compartments for
reactions.
No data about
experiments.
Data about
experiments verifying
the reaction.
Mathematical
relations for
reactions.
All entities defined on
top level. References
between them
indicate the structure
of interactions.
A hierarchy between
the predefined entities
but no possibility for
the user to define
types.
The note and
annotation fields can
be used for extra
information.
Referencing to
publications and
databases
References to other
sources only in the
annotation field.
No environment for
reaction.
No environment for
reaction.
No environment for
reaction.
No data about
experiments.
Detailed data about
experiments verifying
the interaction.
No data about
experiments.
No math relations.
No math relations.
No math relations.
No math relations.
Entities can be
defined separately,
but it is also possible
to structure
information around
interactions.
No inheritance.
A specific list of
attributes can be used
to add information
that does not fit into
the format.
Links to publications
and other databases.
All entities are
definied on top level,
references between
them.
Possible to add
classes to nodes and
edges but no
inheritance.
Information is
structured around
interaction. (Also
possible to get the
information
structured around
other types, i.e.
objects.)
It is possible to define
attributes to nodes
and edges.
No inheritance.
Links to databases,
but references to
publications only
allowed in new
attributes.
Links to publications
and other databases.
All entities are
defined on top level,
references between
them.
No inheritance
No possibility to
define new entity
types.
No possibility to
define new entity
types.
Links to other
databases but not to
publications.
number of interactors for each interaction where SBML allows
most variation in both number of interactors and roles while
BIND only allows interaction pairs.
Besides nodes and edges there is also other information
interesting to store about pathways. PSI MI and BIND have
predefined tags for adding data about experiments verifying an
interaction. They also have tags for referencing publications about
experiments describing the interaction. SBML has the possibility
to store mathematical relations allowing for simulation. SBML is
the only format that contains tags for representing information
needed for mathematical descriptions and simulations. It also
includes information about where the reaction takes place, i.e. the
compartment.
The main structure of all the formats is very similar, reflecting the
structure in a pathway graph. The information is structured around
some representation of the interacting subjects or nodes in the
pathway graph and the interactions themselves, the edges in the
graph. There are however also differences. In most of the formats
the different parts of the information are defined separately with
cross-references used for representing actual connections and
pathways. The only format that does not allow this is the BIND
format, where information always is structured around the
interaction. This makes the format more extensive, since the same
information is often represented in several places in the file. The
PSI MI format is the only format allowing both cross-references
and inclusion of information into the interactions and where the
user can choose the format that suits him best according to his
application.
None of the formats allow for user defined hierarchies. However,
SBML, PSI MI, and XIN all contain possibilities for the user to
add information that does not fit into the predefined format. In
SBML and PSI MI this is done by adding the data into the
flexible note, annotations, or attributes field, which the user can
organise as he wishes. XIN allows the user to define new
attributes to nodes and edges and SBML and PSI MI have
additional slots that can be used for information not fitting into
the other parts of the format.
All formats except SBML have possibilities to add references to
other databases, thus creating links between the modelled subjects
and reactions to other sources that give more information about
this. This is very important, not only for finding more information
but also because it gives a way of referencing and identifying the
same entities between the different databases.
5. IMPLICATIONS FOR INTEGRATION
AND DISCOVERY
The comparison of the formats above shows that the main
structure, i.e. the description of interactions, and interactions are
similar in the formats described. This means that it would be
possible to translate and integrate data between the formats on a
general level.
The difference between the formats lies within the level of
granularity of the details of the formats especially when
describing types of interactors, interactions and interactions on
parts of the molecules. There is also a difference in what
additional information that is represented. Most of the formats
have tags for describing links to publications while only SBML
provides tags for mathematical relations used for simulation.
Information provided in these parts of the formats will, naturally,
be harder to integrate into the other formats.
An interesting problem is that the terminology, for instance names
of interactors, is not standardized between the formats. This will
cause a severe problem in identifying the same objects from the
different data sources. To some amount this problem can be
solved, where the formats provide links to and identifications used
in common sources such as Swissprot [7] or PDB [3].
For search and discovery it is also interesting to consider the
possibility of querying over more than one format of data. In this
case the same possibilities and restrictions as for the integration
hold, querying over the main structure would be easily achieved
while the variants in granularity and additional information would
cause problem. Also in this case the difference in terminology
would cause problem.
All this gives some interesting implications for future work. One
line of work would be to investigate how existing tools for
working with XML such as query languages and database
generators can be used for achieving integrated databases and
discovery tools for data provided in the above formats. Here it
would also be interesting to investigate limitations of the formats
and the scalability of XML. Another interesting line of work
would be to investigate whether work on ontologies and ontology
integration [22] can be used to overcome the terminology problem
when integrating data of various formats.
6. CONCLUSION AND DISCUSSION
In this paper we have investigated and compared XML based
formats for representation of pathway data. The formats compared
are the two proposed standards SBML and PSI MI, and three
formats proposed as exchange formats for particular databases.
The comparison shows that the main structure of the formats is
similar, but that the level of detail of descriptions and also
information content is different.
The comparison shows that the two formats proposed as standards
are the ones allowing for the most flexible and general
representation of molecule interaction. Of these two SBML is
currently more tailored towards applications where simulation and
modelling is needed while PSI MI is preferable if representation
of performed experiments is of importance. Of the three database
formats XIN is the most general and flexible, since it allows the
user for defining own attributes, while the other two are more
tailored to their specific database applications.
The most interesting extension that is needed for the standards is
the possibility for representing molecular structure. Here there are
ongoing proposals both for SBML and within the BioPAX
consortium. Another interesting extension would be the
standardisation of terminology provided by for instance Gene
Ontology. Another very interesting feature would be user defined
entities, since this would give more flexibility to the definitions.
7. ACKNOWLEDGMENTS
The author is grateful to Patrick Lambrix and Vaida Jakoniené for
valuable comments on this work. We would also like to thank the
databases for allowing access to their data.
8. REFERENCES
[1] Achard, F, Vaysseix, G, and Barillot E. XML bioinformatics
and data integration. Bioinformatics 17(2):115-125 (2001).
[2] Bader, G. D, Donaldson, I, Wolting, C. et al. BIND – The
Biomolecular Network Database. Nucleic Acids Research,
29(1). (2001)
[3] Berman H. M, Westbrook J, Feng Z. et al. The Protein Data
Bank. Nucleic Acids Research, 28 235-242 (2000)
[4] BioCyc data collection. http://biocyc.org. (Accessed May
2004.)
[5] The BioPAX consortium. Biopax Ontology Class structure.
September 2003 www.biopax.org. (Accessed April 2004.)
[6] The BioPAX Consortium. The BioPAX data exchange
format v 0.5 (draft release) September 2003.
www.biopax.org. (Accessed april 2004.)
[7] Boeckmann B., Bairoch A., Apweiler R. et al.The SWISSPROT protein knowledgebase and its supplement TrEMBL
in 2003. Nucleic Acids Res. 31:365-370(2003).
[8] Brazma, A. Editorial: On the Importance of Standardisation
in Life Sciences. Bioinformatics 17(2):113-114 (2001)
[9] Database of Interacting Proteins. http://dip.doe-mbi.ucla.edu.
(Accessed May 2004)
[10] Davidson S. B, Tannen V, Crabtree J. et al. KS/Kliesl and
GUS: Experiments in integrated access to genomic data
sources. IBM Systems Journal 40(2):512-531, (2001).
[11] Finney, A. Systems Biology Markup Language (SBML)
Level 3: Proposal: Multi-component Species Features.
Proposal manuscript. March 2004 Available at
http://www.cds.caltech.edu/~afinney/multi-componentspecies.pdf (Accessed April 2004)
[12] Finney, A. and Hucka M. Systems Biology Markup
Language (SBML) Level 2: Structures and Facilities for
Model Definitions. June 28 2003. Available at
http://sbml.org/documents (Accessed April 2004)
[13] Hermjakob H, Montecchi-Palazzi L, Bader G. et al. The
HUPO PSI’s Molecular Interaction format – a community
standard for the representation of protein interaction data.
Nature Biotechnology 22(2):177-183 (2004)
[14] Gene Ontology Consortium. www.geneontology.org
[15] Hogue, C. Biomolecular Interaction Networks Database.
www.blueprint.org/bind/bind.php (Accessed April 2004)
[16] Hucka, M, Finney A, Sauro H. M. et al. The systems biology
markup language (SBML): a medium for representation and
exchange of biochemical network models. Bioinformatics
19(4):524-531 (2003)
[17] Kanehisa, M. and Goto, S. KEGG: Kyoto Encyclopedia of
Genes and Genomes. Nucleic Acids Res. 28, 27-30 (2000).
[18] Krishamurthy, L, Nadeau J, Ozsouyoglu G et al. Pathways
Database System: An integrated set of tools for biological
pathways. Proc. The Eighteenth Annual ACM Symposium
on Applied Computing, (2003), Melbourne, Florida, USA.
[19] Kyoto University Bioinformatics Centre. KEGG pathway
database. http://www.genome.ad.jp/kegg/pathway.html
(Accessed May 2004)
[20] Kyoto University Bioinformatics Centre. KEGG Markup
Language Manual. Specification available through
http://www.genome.ad.jp/kegg/docs/xml/ (Accessed May
2004)
[21] Lambrix, P, Jakoniené, V. Towards transparent access to
multiple biological databanks. Chen P. (Eds), Proceedings
of the First Asia-Pacific Bioinformatics Conference, 53-60,
Adelaide, Australia (2003) Publ. Australian Computer
Society, ISBN 0-909925-97-6.
[22] Lambrix, P, Tan, H, Merging DAML+OIL Ontologies,
Barzdins (Eds) Proceedings of the Sixth International Baltic
Conference on Databases and Information Systems, Riga,
Latvia, 2004. Publ. Latvijas Universitäte,
ISBN 9984-770-11-7.
[23] McEntire R, Karp P, Abrenethy N. et al. An evaluation of
Ontology Exchange Languages for Bioinformatics. Proc
International Conference of Intelligent Systems for
Molecular Biology 8:239-250 (2000) Publ. Amer Assn for
Artificial, ISBN 1577351150
[24] The Molecular INTeraction database.
http://mint.bio.uniroma2.it/mint/ (Accessed May 2004)
[25] Proteomics Standards Initiative Molecular Interaction XML
Format Documentation. Version 1.0. 2002
http://psidev.sourceforge.net/mi/xml/user/ (Accessed April
2004)
[26] Salvinski L, Miller CS, Smith AJ. et al. The Database of
Interacting Proteins: 2004 Update. Nucleic Acids Research
32 Database Issue D449-451. (2004)
[27] The Signalling Pathway Database http://www.grt.kyushuu.ac.jp/spad/ (Accessed April 2004)
[28] Sirava, M, Schäfer T, Eigelsperger M. et al. BioMiner –
modelling, analyzing, and visualizing biochemical pathways
and networks. Bioinformatics 18(Suppl. 2) S219-S230
(2002).
[29] Takakai-Igarishi, T and Kaminuma T. Cell Signaling
Networks Database. http://geo.nihs.go.jp/csndb (Accessed
April 2004)
[30] W3C. The Extensible Markup Language.
http://www.w3.org/XML
[31] Zanzoni, A, Montecci-Palazzi L, Quondam M. et al MINT: a
Molecular INTeraction database. FEBS Letters, 513, 135140. http://www.elsevier.nl/febs/229/29/38/index.htt
Download