XML representations of pathway data: a comparison Lena Strömbäck Department of Computer and Information Science Linköpings Universitet S-581 83 Linköping, Sweden +46 13 28 2324 lestr@ida.liu.se ABSTRACT Standardisation and integration of pathway data is currently an interesting topic within bioinformatics with several consortia, e.g. SBML, PSI and BioPAX. These groups use or consider XML for representation of their standards. Furthermore, XML is used by many of the existing databases containing pathway information for export and exchange of data. In this paper we compare some of the XML representations used by the standardisation committees and existing databases. The contribution of the paper is the comparison and evaluation of the representations together with a discussion on implications for information discovery and integration. Categories and Subject Descriptors J.3. [LIFE AND MEDICAL SCIENCES]: Biology and Genetics General Terms Standardization, Languages Keywords Bioinformatics, data integration, XML, pathways, databases, SBML, PSI MI 1. INTRODUCTION Currently, research within biology rapidly generates new knowledge on how genes, proteins and other substances interact. Each piece of knowledge generated by an experiment gives one small piece in the puzzle of how a cell works. However, due to the complexity of this data there is a need for search and integration of large datasets where the connections between each of these small pieces can be discovered and thus allowing conclusions on how larger parts of the puzzle are constituted. Today a large number of databases containing information about protein and gene interactions, for instance, KEGG [17] [19], BIND [2] [15], MINT [24] [31], BioCyc [4], CSNDB [29], DIP [9] [26], and SPAD [27], are available. All these databases have slightly different purpose and thus the dataset within each of them together with the properties stored for each item and the search possibilities differ between the databases. It would be of high importance for the user to combine and compare information from several of these data sources as well as to have tools for advanced searches in these large datasets. The importance of this is shown by various research directions within bioinformatics, e.g. integration of databases [10] [21] and creation of general purpose tools for search and discovery [18] [28]. One key issue here is the possibility to supply data in formats comparable to each other, which would be very important for both integration of datasets and the possibility for search and discovery of data [8]. There have been a number of proposals for such exchange formats where the work by [1] and [23] give an overview and evaluation. These evaluations show that XML [30] is a format that has many advantages for representing pathway information. This view is also supported by ongoing standardisation efforts, e.g. SBML [11] [16] and PSI MI [13] [25], which both are formats defined as specialisations of XML. Standardisation work aiming at creating common terminology and ontologies for representing pathway data, e.g. BioPAX [5] [6] and GeneOntology [14], are also providing their specification and tools in XML. Many of the databases mentioned above provide their data for export in some variant of XML. In this work we have compared the data provided in XML for exchange from four different databases containing pathway information. The data is provided both in the two proposed standards SBML and PSI MI but sometimes also in an XMLbased format particular to a specific database. The purpose of the evaluation is to compare the different formats with respect to information representation, and expressivity but also to be able to make conclusions on how well the formats are suited for integration and information discovery. The paper starts with a short overview of the databases that have been studied, with focus on the purpose and information contained in each database. We continue by giving an overview of the studied XML-formats, presenting their structure and giving examples fetched from the databases. The paper is concluded by giving a comparison and recommendations for use of the formats. 2. STUDIED DATABASES In this section we give a short overview of the four databases KEGG [17] [19], BIND [2] [15], DIP [9] [26], and MINT [24] [31] for which we have analysed the exchange formats. The main effort of the presentation is to give a general view of the purpose of and information available in the databases since this affects the formats and examples presented in the next section. The KEGG pathway database [17] [19] is one of several molecular databases available in the KEGG framework. It provides information on molecular and gene interaction. Pathway information is provided by a set of reference maps, describing general information about known pathways. The maps can be specified for different species. The maps are clickable and provide links to protein and gene information from other databases, such as PDB [3]. The information is also provided in table format. KEGG can be searched in different ways. The pathways can be reached from a browsable list of available pathways. It is also possible to search for maps based on proteins, reactions or advanced search on combinations of these and other topics. KEGG data is available in SBML and KGML, a format defined for KEGG. The BIND database [2] [15] contains information on interactions between molecules and molecule complexes. The purpose of the database is to allow for prediction and exploration of pathways. In contrast to KEGG the information in BIND is structured around so called interaction pairs, i.e. two interacting molecules. For each interaction pair information about the molecules and the interaction itself is stored. BIND contains detailed information about molecule structure and experimental information related to interactions. BIND can be browsed based on, for instance, interaction, molecular complex, a particular pathway or publication. It also allows free text queries and it is possible to create complex boolean queries on the available fields in the database. BIND data is available in XML in a special BIND format. The DIP database [9] [26] contains information about protein interactions that have been experimentally determined. The data contained in DIP have been curated both manually and computationally to ensure a reliable dataset. The information in DIP is composed of nodes and edges. A node represents a protein participating in an interaction. It contains basic information such as name and subcellular location and references to other databases. The edges correspond to interactions and contain information such as experimental methods and which regions of the proteins are involved in the interaction. The database can be searched for proteins based on names or identifiers. Expressions can be combined by boolean operators. When one or more proteins of interest are found, it is possible to follow their interactions by clicking on links. DIP data is available in PSI MI and DIP´s own format XIN. Finally, the MINT database [24] [31] stores interactions between biological molecules. In particular MINT contains experimentally determined interactions where an enzyme is modifying one of the molecules. The data in MINT is extracted from the scientific literature and the scope is information on mammals. MINT contains information about proteins, interactions and experiments. Proteins are modeled as interactors and can be given a role in each interaction. MINT can be searched based on interactions or interactors. From the result it is possible to follow links to other known interactions or interactors. MINT data is available in the PSI MI format. 3. XML FOR REPRESENTING PATHWAYS In this section we describe five different XML formats defined for exporting data from existing databases for molecular pathways. The first two SBML [11] [16] and PSI MI [13] [25] have been proposed as standards and are defined together in a joint effort of several partners. The latter three formats are proposed as export format for a particular database, but are due to their structure very interesting to compare with the proposed standards. 3.1 SBML: Systems Biology Markup Language Systems Biology Markup Language (SBML) [11] [16] was created by the Systems Biology Workbench Development group in cooperation with representatives from many system and tool developers within bioinformatics. It is a language which aim is to serve as a future standard for information exchange in computational biology and especially within molecular pathways. The focus has been to create a format that allows models to be encoded in XML. The standard’s main releases are called levels. Currently level 2 is defined, its main features are described below and there is ongoing work on level 3. There are already a number of systems supporting SBML, mainly simulation tools, drawing tools and databases. In the current standard, level 2, a pathway is described as a model and each model can contain the following features: Compartment, which is a description of the container or environment in which the reaction takes place. A model normally consists of several compartments for instance subcellular information. For a compartment it is possible to define the size and how compartments are surrounded by each other. Species, which are the substances or entities that take part in the reactions. In SBML species can be everything from a simple ion, for instance a proton or an atom, through simple molecules, for instance glucose, to large molecules such as RNAs or proteins. For species it is possible to specify their spatial size and charge. It is also possible to specify model data, such as the initial concentration and amount and whether this can change during the reaction. Reactions, which are processes that changes one or more of the species. The reaction can be a transformation, a transport or a binding reaction. For a reaction its reactants, products and modifiers are specified by giving references to the relevant species. It is also possible to specify whether this reaction is reversible and its speed by defining a kinetic law, mathematically describing the reaction. Events, which are descriptions of discrete changes in the model. An example is that an event can describe that the concentration of one of the species is halved when the amount of some other species reaches a specific threshold. For an event it is possible to specify what triggers the event, time constraints and the result of the event. In addition, a model can also contain definitions of parameters, mathematical functions, units and mathematical expressions. These are defined on the top level and can be used when defining the other entities. This allows for shorter and more readable descriptions in the rest of the model. SBML models are object-oriented, where all entities above are subtypes of a most general type SBASE and where there are subtypes for Species. Interesting is that the type SBASE contains the fields note and annotation that allow for addition of user and software specific information that is not contained in the rest of the standard. There is currently ongoing work on SBML level 3. This version focuses on solving cases where reactions or species are divided in <sbml …. > <model id="aae00010" name="aae00010"> <listOfCompartments> <compartment id="default" name="default"/> <compartment id="uVol" name="uVol" outside="default"/> </listOfCompartments> <listOfSpecies> <species id="aldehyde …" name="aldehyde dehydrogenase (NAD+)" compartment="uVol" initialAmount="0.0"> </species> More species </listOfSpecies> <listOfReactions> <reaction id="R00235" name="R00235" reversible="true"> <listOfReactants> <speciesReference species="Acetate"> </speciesReference> </listOfReactants> <entrySet …> <entry> <interactorList> <proteinInteractor id="G_1"> <names> <shortLabel/> <fullName>bcl-2-associated protein x, alpha splice form</fullName> </names> <xref> <primaryRef db="DIP" id="232N"/> <secondaryRef db="SWP" id="Q07812"/> <secondaryRef db="PIR" id="A47538"/> <secondaryRef db="GI" id="539664"/> <secondaryRef db="RefSeq" id="NP_620116"/> </xref> <organism ncbiTaxId="9606"> <names> <shortLabel/> <fullName>Homo sapiens</fullName> </names> </organism> </proteinInteractor> More interactors follow here <listOfProducts> <speciesReference species="Acetyl_minus_CoA"> </speciesReference> </listOfProducts> <listOfModifiers> <modifierSpeciesReference species="acetate_minus_…"/> </listOfModifiers> </reaction> More reactions </listOfReactions> </model> </sbml> Figure 1. Example of a KEGG pathway in SBML. specific entities. In particular there is a proposal for representing reactions and species that occur in more than one container simultaneously. There is also a proposal for representing hierarchies of entities, and for describing species through composition or graphs of entities and a proposal for generalizing chemical reactions. Currently, it is possible to retrieve data from KEGG and BioCYC in SBML. Figure 1 shows how a KEGG pathway is represented in SBML. The example has been shortened for the sake of brevity and readability, but shows clearly how the different entities of SBML are used for KEGG data. 3.2 PSI MI The Proteomics Standards Initiative Molecular Interaction XML format (PSI MI) [13] [25] was developed by the Proteomics Standards Initiative, founded in 2002, as one initiative of the Human Proteome Organisation (HUPO). The aim of the initiative is to develop standards for data representation in proteomics to facilitate data comparison, exchange and verification. As a first step they have defined standards for protein-protein interaction and mass spectrometry. Interesting for this work is the standard for protein-protein interaction, the PSI MI format. The aim is to extend the format to also include other types of molecules. The </interactorList> <interactionList> <interaction> <experimentList> <experimentDescription id="DIP_1X"> <bibref> <xref> <primaryRef db="pubmed" id="91958"/> </xref> </bibref> <interactionDetection> <names> <shortLabel>Experimental</shortLabel> </names> <xref> <primaryRef db="DO" id="DO:0045"/> <secondaryRef db="PSI" id="MI:045"/> </xref> </interactionDetection> </experimentDescription> </experimentList> <participantList> <proteinParticipant> <proteinInteractorRef ref="G_1"/> </proteinParticipant> <proteinParticipant> <proteinInteractorRef ref="G_2"/> </proteinParticipant> </participantList> <xref> <primaryRef db="DIP" id="1E"/> </xref> </interaction> More interactions follow here. </interactionList> </entry> </entrySet> Figure 2. Example of DIP data in PSI MI format. format is intended to be used for exchange of data, thus it is not designed for efficient storage. All data in PSI MI is structured around an entry. An entry describes one or more interactions that can be considered as one unit. The entry contains the following parts: Source and availabilitylist, used for describing the source of the data, for instance, an organization, and where the data can be accessed, for instance a database or similar. Experimentlist, which is a list of references to experiments verifying an interaction. The list normally contains links to publications. Interactorlist, which is a list of proteins participating in the interaction. For each interactor information about, for instance, substructure can be defined. Interactionlist, a list of the actual interactions. For each interaction it is possible to set one or more names, the type of interaction and also a database reference to more information about the interaction. The participating proteins are described by their names or references to the interactorlist. It is also possible to set a confidence level for detecting this protein in the experiment, the role of the protein and whether the protein was tagged or overexpressed in the experiment. In addition to this each interaction has a description of availability and experiments which normally are references to the lists above. Attributelist, a possibility for the user to add further information that does not fit into the entries above. This feature can be used for the different entities above. Figure 2 shows a part of data from the DIP database and shows how DIP uses PSI MI for exporting their data. The example has been shortened for the sake of brevity and readability. <project ….> <attributes> <nodeAtt name="descr" type="text"> </nodeAtt> <nodeAtt name="organism" type="text"> </nodeAtt> <nodeAtt name="taxon" type="text"> </nodeAtt> <edgeAtt name="class" type="text"> </edgeAtt> <edgeAtt name="name" type="text"> </edgeAtt> </attributes> <node uid="DIP:232N" id="G:1" name="BAXA_HUMAN" class="protein"> <feature name="SWP:Q07812" class="cref"> <src>SwissProt</src> </feature> <att name="descr"> <val>bcl-2-associated protein x, alpha splice form</val> </att> <att name="taxon"> <val>9606</val> </att> <att name="organism"> <val>Homo sapiens</val> </att> </node> More proteins follow <edge uid="DIP:1E" id="G:3" from="G:1" to="G:2" class="link"> <feature uid="DIP:1X" class="exp:s"> <src>PMID:9194558</src> <val>Experimental</val> </feature> </edge> 3.3 XIN More edges follow XIN is used as an exchange format for exporting data from the DIP database. The XIN format is very flexible. It is possible for the user to define additional constraints and attributes and, thus, tailor the format for each application. In the XIN format a document consists of three parts or sections. </project> Attributes, this part defines the set of attributes that is used to describe each node in the graph. For each attribute that the user wants to use it is possible to define its name, its type and a default value. If any of the attributes defined is not given a value for a node or edge, its value is set to the specified default value. Nodes, describe the interactors in the network. The only required information for a node is an identity, a name and a class specifying which type of interactor, for instance protein, this is. Edges, the edges correspond to the interactions. Each edge is defined by giving an identity, a class and the nodes that this interaction concerns. In addition to the attributes, the nodes and edges can also be given a set of features specifying additional properties. In figure 3 we show how the example fetched from DIP in PSI MI looks like in XIN format. Note specifically how the example starts with defining extra attributes to the node and the use of features to Figure 3. Example of DIP data in XIN format. make references to other databases. The example has been shortened for readability. 3.4 BIND export format The BIND database has its own XML exchange format for exporting data from the database. This format is very verbose mainly due to the fact that BIND stores a lot of detailed information about the interacting molecules, i.e. information about which parts of the molecules interact. As the BIND database, BIND XML is structured around interaction pairs determined by experiments. Each entry describes the interaction, interacting molecules and publications. Data is often repeated and little use is made of cross-references. Compared to the other formats BIND XML has many levels of tags that makes the format less readable for a human reader. The example below shows an excerpt of a BIND XML file. In the example we have removed a lot of data and many of the levels due to space and readability. The main structure of the format is centered around the interaction pairs (BIND-interaction) and the interacting molecules (BIND-objects) located as subparts of the interaction. This is exemplified in the XML code in figure 4. The <BIND-Submit> <BIND-Submit_interactions> ….. <BIND-Interaction> <BIND-Interaction_iid> <Interaction-id>64786</Interaction-id> </BIND-Interaction_iid> <BIND-Interaction_a> <BIND-object> Identification and reference information <BIND-object_descr>Chain B, Dna … </BIND-object_descr> </BIND-object> </BIND-Interaction_a> <BIND-Interaction_b> Similar to the description of object a </BIND-Interaction_b> <BIND-Interaction_descr> <BIND-descr> <BIND-descr_simple-descr>Interaction between 103D_B and 103D_A. </BIND-descr_simple-descr> <pathway name="path:map04070" org="map" title="Phosphatidylinositol system" image="http:…..gif" link="http://www.genome.ad.jp/…."> <entry id="1" name="ec:3.1.3.67" type="enzyme" link="http…..1.3.67"> <graphics name="3.1.3.67" type="rectangle" x="179" y="660" width="45" height="17"/> </entry> Many more entries <relation entry1="7" entry2="39" type="ECrel"> <subtype name="compound" value="32"/> </relation> Many more relations No reactions in this pathway. </pathway> Figure 5. Example of KEGG data in KGML. Under which conditions does this interaction occur and which parts of the molecules interact – omitted. </BIND-descr> </BIND-Interaction_descr> Reference to publications and authors describing this interaction – omitted </BIND-Interaction> More interactions </BIND-Submit_interactions> </BIND-Submit> Figure 4. Example of BIND-data. two entries BIND-Interaction_a and BIND-Interaction_b define the two proteins interacting within the interaction. 3.5 KEGG Markup Language (KGML) The KEGG Markup language (KGML) [20] is the XML-based format used to export the KEGG graph objects, especially the pathways. In KGML information about the graphical objects and their relations in the drawings can be represented. In KGML, exemplified in figure 5, the pathway is the top element and each pathway corresponds to a pathway map in the KEGG database. For each pathway name, organisation and identification number are specified. It is also possible to specify an additional title, a link to an image and a link to where more information about this pathway can be found. The pathway consists of a list of entries, reactions and relations. Entry, this object contains information about a node of the pathway. It must have a name, type and identifier. It may also have a link to other resources, a link to a reaction and a link to a map entry, an explanation and an identification number. The entry often corresponds to a graphical element in the map and in that case the coordinates and formats for this attribute are defined for the entry. When the entry is a complex node, the subtype component is used for the entry. This subtype consists of a reference to the graphical component that this object is a part of. Relation, specifies a relationship between two proteins or two ortholog groups. (Corresponds to an arrow between two rectangles in the KEGG graph.) It is specified by giving the identifiers of the two objects and the type of this relation. It is also possible to specify more detailed information about the interaction as subtypes. Reaction, specifies a chemical reaction between a substrate and a product. It corresponds to an arrow between circles in the KEGG diagrams. For a reaction the attribute name and type are defined. The substrate and product elements are defined by giving identifiers of the entries. 4. COMPARISON OF THE FORMATS Table 1 gives a summary and comparison of the main features for the presented formats. The comparison is structured into six different sections: The environment and usage of the format; The representation of interactors, i.e. proteins or similar; The representation of the interaction itself; Other information that can be represented in the formats; Formal expressiveness and possibilities of referencing. For the environment SBML and PSI MI are both defined with the aim of being general standards and are in use by existing systems. There are existing tools available when working with the formats. Considering the representation of interactors all formats have at least one entity for representing subjects. In some of the formats, i.e. PSI MI, KGML, and XIN, it is possible to further specify the class of the interactor. Currently PSI MI and BIND allow for specifying which parts of the molecules that participate in the interaction. This is, however, a topic that is under consideration for the next version of SBML where a proposal exists [11]. All the presented formats have one or more entities for representing reactions with, as for interactors, a difference in level of granularity. SBML has several subtypes for representing reactions, while, for instance, PSI MI only have one. There is also a difference in whether the interactors can be given roles and the Table 1: Comparison of the presented XML-formats Environment for the specification: - Inventors - Existing tools - Used by SBML PSI MI XIN BIND XML KGML Systems Biology Workbench Development group. Proteomics Standards Initiative. DIP inventors. BIND inventors. KEGG inventors. No current tools. No current tools. Tools for viewing and analysis. Used by the DIP database. Used by the BIND database. Tool for showing a graph in KGML. Interactor. Node. Objects. Entry. Proteins sequences and sites can be described. No description of parts of molecules (possible to add in user defined attributes). Detailed description of reacting substructure. No representation of parts of molecules. Reaction. Interaction. Relation and reaction. Each interactor can be given a role. Edge, can be further specified by subclasses. Interaction pair. Each reaction allows interactors of three predefined roles reactants products or modifiers. No roles. Unbounded number of interactors. No roles for interactors. No roles for the relation, substrate and product used for reaction. Tools for validation, visualization and conversion. Used by simulation systems, drawing tools and database. Representation of interactors: - Used notation - Description of parts of interactors. Representation of interaction: - Used notation - Role of interactor - No of interactors Species, with the subtypes transformation, transport and binding. Datasets available from DIP and MINT. More databases accepts data in PSI MI. No current representation of parts of molecules but a proposal exists for next release of SBML. Two interactors. Used by the KEGG database. Two interactors. Two interactors. Unbounded for each role. Other predefined entities: - Environment for reaction - Experiment data - Math relations Expressiveness - Main structure - Inheritance - Definition of new attributes and entities Compartment defined as the environment for reactions. It is possible to define compartments for reactions. No data about experiments. Data about experiments verifying the reaction. Mathematical relations for reactions. All entities defined on top level. References between them indicate the structure of interactions. A hierarchy between the predefined entities but no possibility for the user to define types. The note and annotation fields can be used for extra information. Referencing to publications and databases References to other sources only in the annotation field. No environment for reaction. No environment for reaction. No environment for reaction. No data about experiments. Detailed data about experiments verifying the interaction. No data about experiments. No math relations. No math relations. No math relations. No math relations. Entities can be defined separately, but it is also possible to structure information around interactions. No inheritance. A specific list of attributes can be used to add information that does not fit into the format. Links to publications and other databases. All entities are definied on top level, references between them. Possible to add classes to nodes and edges but no inheritance. Information is structured around interaction. (Also possible to get the information structured around other types, i.e. objects.) It is possible to define attributes to nodes and edges. No inheritance. Links to databases, but references to publications only allowed in new attributes. Links to publications and other databases. All entities are defined on top level, references between them. No inheritance No possibility to define new entity types. No possibility to define new entity types. Links to other databases but not to publications. number of interactors for each interaction where SBML allows most variation in both number of interactors and roles while BIND only allows interaction pairs. Besides nodes and edges there is also other information interesting to store about pathways. PSI MI and BIND have predefined tags for adding data about experiments verifying an interaction. They also have tags for referencing publications about experiments describing the interaction. SBML has the possibility to store mathematical relations allowing for simulation. SBML is the only format that contains tags for representing information needed for mathematical descriptions and simulations. It also includes information about where the reaction takes place, i.e. the compartment. The main structure of all the formats is very similar, reflecting the structure in a pathway graph. The information is structured around some representation of the interacting subjects or nodes in the pathway graph and the interactions themselves, the edges in the graph. There are however also differences. In most of the formats the different parts of the information are defined separately with cross-references used for representing actual connections and pathways. The only format that does not allow this is the BIND format, where information always is structured around the interaction. This makes the format more extensive, since the same information is often represented in several places in the file. The PSI MI format is the only format allowing both cross-references and inclusion of information into the interactions and where the user can choose the format that suits him best according to his application. None of the formats allow for user defined hierarchies. However, SBML, PSI MI, and XIN all contain possibilities for the user to add information that does not fit into the predefined format. In SBML and PSI MI this is done by adding the data into the flexible note, annotations, or attributes field, which the user can organise as he wishes. XIN allows the user to define new attributes to nodes and edges and SBML and PSI MI have additional slots that can be used for information not fitting into the other parts of the format. All formats except SBML have possibilities to add references to other databases, thus creating links between the modelled subjects and reactions to other sources that give more information about this. This is very important, not only for finding more information but also because it gives a way of referencing and identifying the same entities between the different databases. 5. IMPLICATIONS FOR INTEGRATION AND DISCOVERY The comparison of the formats above shows that the main structure, i.e. the description of interactions, and interactions are similar in the formats described. This means that it would be possible to translate and integrate data between the formats on a general level. The difference between the formats lies within the level of granularity of the details of the formats especially when describing types of interactors, interactions and interactions on parts of the molecules. There is also a difference in what additional information that is represented. Most of the formats have tags for describing links to publications while only SBML provides tags for mathematical relations used for simulation. Information provided in these parts of the formats will, naturally, be harder to integrate into the other formats. An interesting problem is that the terminology, for instance names of interactors, is not standardized between the formats. This will cause a severe problem in identifying the same objects from the different data sources. To some amount this problem can be solved, where the formats provide links to and identifications used in common sources such as Swissprot [7] or PDB [3]. For search and discovery it is also interesting to consider the possibility of querying over more than one format of data. In this case the same possibilities and restrictions as for the integration hold, querying over the main structure would be easily achieved while the variants in granularity and additional information would cause problem. Also in this case the difference in terminology would cause problem. All this gives some interesting implications for future work. One line of work would be to investigate how existing tools for working with XML such as query languages and database generators can be used for achieving integrated databases and discovery tools for data provided in the above formats. Here it would also be interesting to investigate limitations of the formats and the scalability of XML. Another interesting line of work would be to investigate whether work on ontologies and ontology integration [22] can be used to overcome the terminology problem when integrating data of various formats. 6. CONCLUSION AND DISCUSSION In this paper we have investigated and compared XML based formats for representation of pathway data. The formats compared are the two proposed standards SBML and PSI MI, and three formats proposed as exchange formats for particular databases. The comparison shows that the main structure of the formats is similar, but that the level of detail of descriptions and also information content is different. The comparison shows that the two formats proposed as standards are the ones allowing for the most flexible and general representation of molecule interaction. Of these two SBML is currently more tailored towards applications where simulation and modelling is needed while PSI MI is preferable if representation of performed experiments is of importance. Of the three database formats XIN is the most general and flexible, since it allows the user for defining own attributes, while the other two are more tailored to their specific database applications. The most interesting extension that is needed for the standards is the possibility for representing molecular structure. Here there are ongoing proposals both for SBML and within the BioPAX consortium. Another interesting extension would be the standardisation of terminology provided by for instance Gene Ontology. Another very interesting feature would be user defined entities, since this would give more flexibility to the definitions. 7. ACKNOWLEDGMENTS The author is grateful to Patrick Lambrix and Vaida Jakoniené for valuable comments on this work. We would also like to thank the databases for allowing access to their data. 8. REFERENCES [1] Achard, F, Vaysseix, G, and Barillot E. XML bioinformatics and data integration. Bioinformatics 17(2):115-125 (2001). [2] Bader, G. D, Donaldson, I, Wolting, C. et al. BIND – The Biomolecular Network Database. Nucleic Acids Research, 29(1). (2001) [3] Berman H. M, Westbrook J, Feng Z. et al. The Protein Data Bank. Nucleic Acids Research, 28 235-242 (2000) [4] BioCyc data collection. http://biocyc.org. (Accessed May 2004.) [5] The BioPAX consortium. Biopax Ontology Class structure. September 2003 www.biopax.org. (Accessed April 2004.) [6] The BioPAX Consortium. The BioPAX data exchange format v 0.5 (draft release) September 2003. www.biopax.org. (Accessed april 2004.) [7] Boeckmann B., Bairoch A., Apweiler R. et al.The SWISSPROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31:365-370(2003). [8] Brazma, A. Editorial: On the Importance of Standardisation in Life Sciences. Bioinformatics 17(2):113-114 (2001) [9] Database of Interacting Proteins. http://dip.doe-mbi.ucla.edu. (Accessed May 2004) [10] Davidson S. B, Tannen V, Crabtree J. et al. KS/Kliesl and GUS: Experiments in integrated access to genomic data sources. IBM Systems Journal 40(2):512-531, (2001). [11] Finney, A. Systems Biology Markup Language (SBML) Level 3: Proposal: Multi-component Species Features. Proposal manuscript. March 2004 Available at http://www.cds.caltech.edu/~afinney/multi-componentspecies.pdf (Accessed April 2004) [12] Finney, A. and Hucka M. Systems Biology Markup Language (SBML) Level 2: Structures and Facilities for Model Definitions. June 28 2003. Available at http://sbml.org/documents (Accessed April 2004) [13] Hermjakob H, Montecchi-Palazzi L, Bader G. et al. The HUPO PSI’s Molecular Interaction format – a community standard for the representation of protein interaction data. Nature Biotechnology 22(2):177-183 (2004) [14] Gene Ontology Consortium. www.geneontology.org [15] Hogue, C. Biomolecular Interaction Networks Database. www.blueprint.org/bind/bind.php (Accessed April 2004) [16] Hucka, M, Finney A, Sauro H. M. et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19(4):524-531 (2003) [17] Kanehisa, M. and Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27-30 (2000). [18] Krishamurthy, L, Nadeau J, Ozsouyoglu G et al. Pathways Database System: An integrated set of tools for biological pathways. Proc. The Eighteenth Annual ACM Symposium on Applied Computing, (2003), Melbourne, Florida, USA. [19] Kyoto University Bioinformatics Centre. KEGG pathway database. http://www.genome.ad.jp/kegg/pathway.html (Accessed May 2004) [20] Kyoto University Bioinformatics Centre. KEGG Markup Language Manual. Specification available through http://www.genome.ad.jp/kegg/docs/xml/ (Accessed May 2004) [21] Lambrix, P, Jakoniené, V. Towards transparent access to multiple biological databanks. Chen P. (Eds), Proceedings of the First Asia-Pacific Bioinformatics Conference, 53-60, Adelaide, Australia (2003) Publ. Australian Computer Society, ISBN 0-909925-97-6. [22] Lambrix, P, Tan, H, Merging DAML+OIL Ontologies, Barzdins (Eds) Proceedings of the Sixth International Baltic Conference on Databases and Information Systems, Riga, Latvia, 2004. Publ. Latvijas Universitäte, ISBN 9984-770-11-7. [23] McEntire R, Karp P, Abrenethy N. et al. An evaluation of Ontology Exchange Languages for Bioinformatics. Proc International Conference of Intelligent Systems for Molecular Biology 8:239-250 (2000) Publ. Amer Assn for Artificial, ISBN 1577351150 [24] The Molecular INTeraction database. http://mint.bio.uniroma2.it/mint/ (Accessed May 2004) [25] Proteomics Standards Initiative Molecular Interaction XML Format Documentation. Version 1.0. 2002 http://psidev.sourceforge.net/mi/xml/user/ (Accessed April 2004) [26] Salvinski L, Miller CS, Smith AJ. et al. The Database of Interacting Proteins: 2004 Update. Nucleic Acids Research 32 Database Issue D449-451. (2004) [27] The Signalling Pathway Database http://www.grt.kyushuu.ac.jp/spad/ (Accessed April 2004) [28] Sirava, M, Schäfer T, Eigelsperger M. et al. BioMiner – modelling, analyzing, and visualizing biochemical pathways and networks. Bioinformatics 18(Suppl. 2) S219-S230 (2002). [29] Takakai-Igarishi, T and Kaminuma T. Cell Signaling Networks Database. http://geo.nihs.go.jp/csndb (Accessed April 2004) [30] W3C. The Extensible Markup Language. http://www.w3.org/XML [31] Zanzoni, A, Montecci-Palazzi L, Quondam M. et al MINT: a Molecular INTeraction database. FEBS Letters, 513, 135140. http://www.elsevier.nl/febs/229/29/38/index.htt