INVESTIGATING THE PROMISE OF A PORTABLE BIOLOGICAL MODEL Jason Smale ABSTRACT – The use of computational tools to model and analyse research has become a regular practice for today’s system biologist. Unfortunately this use of technology has its own drawbacks, which can be very frustrating to the budding researcher. One of these frustrations includes the difficulty one experiences when attempting to transport a biological model developed on one computational tool to another. This paper discusses the problems presented when biological models are not portable across different computational tools. The author then investigates how two existing frameworks attempt to structure the data contained in a model to allow for portability between tools. Recommendations are then provided for the development of the two languages in hopes of creating even greater portability. I. INTRODUCTION The recent explosion of interest in studying biological networks at a systems level has resulted in the development of a variety of computational tools used to model and analyze the systems biologist’s research. Systems biology research is performed in many different areas of biology all around the world, consequentially a very diverse mix of computational tools has been created. The diversity of these software tools has lead to numerous problems. For example, simulation models created by one research group cannot be easily compared or shared with another research group, because the tools developed by these groups are incompatible with each other. To obtain the complementary resources of the other group, the researcher must manually re-encode the model into their tool, which is a time-consuming and error-prone process.[1] In addition, models published in journals often include instructions for obtaining the model definitions, but if the author uses a different software tool as their modeling environment, examining the model definitions in the readers modeling environment will prove to be very difficult.[1] As the field of systems biology matures, and the biological models become more complex, these problems will become more detrimental, since researchers will increasingly need to communicate their complex results as computational models instead of ‘simple’ block diagrams. In light of these problems, researchers realized there was a great need for a specific modeling framework, thus enhancing a model’s portability from one computational tool to another. Both the Systems Biology Markup Language and the Cell Markup Language were designed to be specific modeling frameworks, allowing greater interaction between computational tools, thus enabling the users to spend more time on their research instead of struggling with model data formatting issues. II. SYSTEM BIOLOGY MARKUP LANGUAGE The development of the System Biology Markup Language (SBML) was a project headed by the ERATO Systems Biology Workbench Development Group. SBML was designed to facilitate the information exchange between two analysis packages regarding biochemical reactions, cell signalling pathways, metabolic pathways and many other topics commonly found in the system biology research field. A. Approach to its design Simply put, SBML is a framework, which uses eXtensible Markup Language (XML) as its language base to accurately structure, store and allow for easy transmission of data to other applications. XML was chosen to be used as the SBML language base since XML is a software and hardware independent language used to structure, store and easily send data over the internet.[2] Even at the onset of the SBML project, XML already had obtained widespread acceptance as the standard data language for bioinformatics. [1] SBML Level 2 was created in conjunction with the authors of the following systems: BASIS, Bio Sketch Pad, BioSpreadsheet, BioSpice, CellDesigner, Cellerator, COPASI, DBsolve, E-CELL, ESS, Gepasi, Jarnac, JDesigner, JigCell, MCell, NetBuilder, PathScout, ProMoT/DIVA, StochSim, and Virtual Cell.[1] The input from these authors was of great importance because the data encapsulated in the SBML model was intended to be used in each of these computational tools. It is important to note, the definition of the SBML model does not specify how each application will read/write the data encapsulated in SBML. The computational tool is responsible for translating internal data structures to and from SBML. SBML simply ensures the data is structured in such a way that any program listed above is able to interpret it. B. Overview of the Components used in SBML This section summarizes information found in the SBML Level 2 Specification, highlighting the major components of SBML, so the reader can gain a general sense of its capabilities. For a more detailed description of each component, please refer to the SBML Level 2 specification [3]. In abstract terms, a chemical reaction takes place inside a specific volume or compartment and involves a species (reactants, products), rate laws, and different parameters in the rate laws. Therefore an SBML model is simply comprised of lists of the following components, and its architecture is demonstrated in Fig. 1. Function definitions - This attribute allows functions to be called by a meaningful declared name throughout the rest of the model. The function definition is also used to interface any general mathematical functions that are to be used in the model with the Math Markup Language (MathML), which is complementary to SBML and allows the implementation of a variety of mathematical operations. present. In addition, the user has the option to declare its initial amount or concentration, units, charge present (if it is an ion), and whether the species concentration or amount is fixed or dynamic throughout the simulation. If the species is fixed, this implies there is an external mechanism or source that maintains the constant quantity of the species in the compartment throughout the course of the reaction. Parameter definitions – This component declaration associates a name to a quantity, making the code more readable since the programmer is referring to meaningful names instead of values. SBML allows the declaration of global parameters (used in any reaction in the model) or local parameters (used in only one reaction). This declaration area involves the declaration of global parameters only. Local parameters are defined in the reaction module in which they are involved. Reaction definitions – The reaction component is a list of reactant and product species that are involved in the modeled reaction. A kineticLaw attribute is used to provide an algorithm describing the rate at which each reactant combines to form each product. The kinetic Law attribute uses a MathML expression to accurately model the rate. In the reaction component you also have the ability to declare if the reaction is a ‘fast’ reaction, which is relevant when computing equilibrium concentrations of rapid reactions. FIG. 1 – Pictorial Representation of the SBML Framework Unit definitions – Special expressions of quantities used throughout the model are defined in this component. An example would be the declaration of mL/s or nΩ/˚C. These units are constructed by combining standard units found in the SBML unit database. The formula to define a user defined unit is as follows: unit new (scalar 10 scale unit base ) offset The scalar multiplier may be used to convert inches to millimeters. The 10scale may be used to convert meters to millimeters. The exponent may be used to represent a volume or area, and finally the offset may be used in the case of representing degrees Fahrenheit instead of Celsius. exponent Compartment definitions – A compartment in SBML represents a bounded volume in which species are located. If there is no compartment defined the model is assumed to be located within a single unit volume. The compartment declaration allows the user to assign a name, units, and a total volume to the compartment. The user also has the option to declare a compartment as simply surrounding another compartment of a specific size. This type of volume allocation would be required when referring to the volume of the nucleus and the cell’s cytoplasm surrounding the nucleus. Species definitions – The species element in SBML is used to represent entities such as ions and molecules that participate in the reaction. An example could include anything from glucose to RNA. When declaring a species element, the user must define its name and the compartment in which it is List of rules (optional) – A rule is used to establish mathematical constraints on parameters for the specific cases where the constraints cannot be included in the reaction algorithm. An example would be adding saturation limits to a specific parameter. List of events(optional) – An event is a statement describing an instantaneous or discontinuous change to any component (parameter value, compartment size, species concentration) which can be triggered in response to a specific state of the model. SBML (Level 2 only) also allows the user to make use of the metadata attribute. Simply put, metadata is “data about data”, and provides information to facilitate searches of collections of models. The user includes descriptive information about the model in the preliminary definition metaid field, which can help other modellers determine whether it would be worthwhile using the model for their own research. III. CELL MARKUP LANGUAGE The development of the Cell Markup Language (CellML) was a project headed by the BioEngineering Institute at the University of Auckland, New Zealand. Simply stated, the purpose of CellML is to structure, store and exchange biological model data between multiple applications used for simulation and analysis. A. Approach to its design Similar to SBML, CELLML is a framework, which uses XML as its language base to accurately structure, store and allow for easy transmission of data to other applications. CellML is designed to support the models of cellular and subcellular processes. The scope of the CellML language is specifically limited to how the parts of a model are related to one another both physically and logically. MathML and RDF are employed by CellML to represent mathematical equations describing biological processes and metadata, respectively. For more information regarding MathML and RDF please refer to the CellML website, since they fall outside of the CellML scope and will not be discussed in this document.[4] B. Overview of the Components used in CELLML This section summarizes information found in the CellML Version 1.0 Specification, highlighting the major components of CellML, so the reader can gain a general sense of its capabilities. For a more detailed description of each component, please refer to the CellML Version 1.0 Specification [5]. CellML designers decided that any model could be described as a network of connections between self-contained components. This design concept is responsible for CellML’s flat model architecture. In fact, a CellML model only contains four element definition types, units, components, groups, and connections. The CellML is known as a flat architecture, because there are only three elemental layers to a CellML model. The top layer is simply the model. The model is segregated into components. These components are then further segregated into variables, as demonstrated in Fig. 2 shown below. In CellML, a hierarchy refers to the relationship of all the components in the model. The modelled components form a “parent-child tree” with respect to a specific grouping scheme. In CellML, there are two types of grouping schemes, encapsulation and containment. Encapsulation refers to components grouped in a logical hierarchy, whereas the containment grouping refers to a physical hierarchy, for example the nucleus would be a child component of the cell. Using encapsulation grouping, models become more structured since the only exchange of information is between the parent and child components or between sibling components. This is similar to the efficient and structured procedure in which the internet passes its information. When a client wishes to receive some information, it sends a request to a router, which is the equivalent of a parent component in the CellML model. The router then relays the request onto the appropriate server (child component) who then returns the answer to the router, who relays it back to the client. This model architecture is especially effective if a modeller wishes to re-use an encapsulated sub-model, they would simply treat the sub-model as a black box, which would only have to interface with the parent component. An encapsulated hierarchical relationship is demonstrated in Fig.2. Fig. 2 - Pictorial Representation of the CellML Framework As mentioned before a component contains variables that have a biological significance in the model. When defining a variable the user has the option of declaring its initial values and their connection properties. Connection properties refer to which other components (sibling or parent) can access or even alter the variable’s present value. The component may also contain equations that modify the values of these variables. These equations may be represented quantitatively using a MathML mark-up or can be declared in the reaction attribute of the component used to express qualitative reaction. The reaction attribute is also used when MathML cannot accurately model the reaction and additional rules need to be implemented to better describe the reaction. For example, it is recommended that the user use the reaction attribute instead of MathML when the stoichiometry of an equation must be considered. In the reaction attribute, each variable reference must specify that variables role in the reaction (reactant, product, catalyst, etc). Also in the component element of the model, the user has the option to define the metadata regarding the component. This is different than SBML, where metadata was used as a reference for the overall model. A model built from many components encourages the re-use of components. For example, if a researcher is modeling a cell’s membrane, all of the equations and the variables’ connectivity could be defined in one Na+ channel component and then simply connected to the proper parent and sibling components to create an accurate model. CellML’s unit definitions are very similar to the way SBML units are defined, but the scale attribute is renamed as the prefix, which is entered as a string in CellML. To better illustrate the use of the prefix string, one would enter ‘centi’ into the code instead of ‘10-2’ to obtain the unit cm. The user defined unit declaration equation is shown below: exp onent unit new (multiplier prefix unit base ) offset IV. DIFFERENCES BETWEEN THE TWO MARKUP LANGUAGES – CellML’s framework architecture is very different from SBML’s architecture. Generally speaking, SBML’s use of compartments is the same as using a containment hierarchy in CellML, but SBML does not have the capability to logically organize its species/compartments in a way that is similar to CellML’s encapsulation hierarchy. Due to the absence of the encapsulation hierarchy in the SBML language it becomes very difficult to structure a very large model with many different species and reactions in one compartment. For example, when modelling a cell membrane with various ionic channels, enzyme receptors, molecule carriers and a variety of other species, there is no way to check whether a Na+ has been coded to influence a K+ channel as a mistake other than searching the hundreds or possibly thousands of lines of code. The encapsulation hierarchy eliminates this possibility because the modeller would make the Na+ channel the parent and the Na + the child. Also, CellML allows the modeller to re-use a component, whereas in SBML, all species must be unique. As a result, if you wish to have 100 Na+ channels in the membrane, for SBML you must have 100 different sodium channel names, whereas in CellML, you only need one, making troubleshooting and overall code organization much simpler. Another difference between the two markup languages is SBML’s ability to declare a reaction as a ‘fast’ reaction. If a model was to be converted from SBML to CellML this Boolean flag would be ignored and the simulation package would have to use an ODE to calculate the reactions concentration. Because of these differences, the research community has generally accepted that SBML is used to structure models which describe biological pathways, whereas CellML has a more flexible structure, allowing it to describe any biological system mathematically.[6] In fact, CellML is being developed in consideration with two other markup languages that are under development at the University of Auckland. These two languages, AnatML and FieldML are intended to structure models of entire organs to structure and store information pertaining to the physical distribution of parameters for each of the components used in CellML, respectively. The goal is to integrate these markup languages together, to provide a complete framework that would accurately describe all models sizing from a sub-cellular reactions to an organ. Although this goal is ambitious, the researchers at University of Auckland consider it viable.[4] V. THE ROAD AHEAD A. CELLML’S AND SBML’S PAST HELPS THEIR FUTURE If both CellML and SBML are to become comprehensive model definition languages, their developers must share their ideas for two main reasons. The first reason is the other language’s developers are a great resource for ideas and improvements that could be implemented into your language, making the modelling framework capable of handling more complex models. Thankfully, communication between the SBML developers and CellML developers has been very open and was the catalyst for SBML introducing the use of metadata and MathML into their second version.[7] A second reason the ideas must be open to both parties is that as the languages become more complex to describe more complex models, it is extremely important that the model described by one language can still be translated to the other language.[7] An example of this need arises because Virtual Cell uses CellML as their model definition language and Darpa BioSPICE uses SBML. If Virtual Cell was interested in a model that Darpa had developed, their program must be able to easily translate the SBML framework to the CellML framework with minimal loss of data and model description. Remember, the second reason the developers must work together, is the underlying reason for the need of the model definition languages in the first place. B. IMPROVEMENTS FOR BOTH LANGUAGES Not only is it important for the developers to continue to share existing functionalities, it is also important that they share ideas and knowledge regarding their future plans of development. This is important because as each language’s individual framework improves the developers encounter common problems. In the opinion of the author of this paper, the most notable are the present inability to accurately and easily model a stochastic biological process, and integrating the use of the ontology of biochemical reactions presently being defined by researchers all over the world. The ontology problem is extremely important. Without the use of a proper ontology, models, which are filed away in databases, may not be found by a wanting scientist because (s)he used a different annotation to search for a particular element. This problem is eloquently presented in a paper by Mark Ettinger where he highlights how complications can arise from the usage of different variable names and reaction sequences, when describing the exact same models.[8] Although metadata is a start to overcoming this problem, it is not the complete answer. To properly standardize a model’s ontology, all parties (researchers, ontology markup language developers, SBML developers, CellML developers and database developers) must work together and all must agree that the standard annotation can be implemented into their respective work. Not only is producing a more comprehensive model definition language important to completing the developers goal of creating a portable model framework, they must ensure that the language is still relatively easier to use than having a researcher simply re-writing an existing model in their own computational tool. A good example of an application that has developed into a large program able of modelling a very diverse range of systems, but still maintains its original simplicity is MATLAB. MATLAB originally was created to compute complex mathematical equations, without the user having to code in FORTRAN, therefore making the users job easier. Today MATLAB has the ability to do anything from modelling digital communication systems to calculating a bottom line for a financial institution. This is done through the use of toolboxes, which help MATLAB maintain its user friendliness. But, regardless of what you are using MATLAB for, it is still considered to be very user friendly relative to other alternatives. The author is not suggesting that CellML and SBML should introduce toolboxes into their framework, but instead stresses the importance of maintaining the frameworks initial simplicity when adding more options to allow the model language to be more comprehensive. Another important area that SBML and CellML must address to achieve their goal of portability, is to promote the usage of their languages in the scientific community. So far they have done this by giving presentations and distributing documentation all over the world (on the web, at conferences, etc.). The have also provided example models on their home websites and ensured that their websites are informative. Unfortunately there still seems to be many researchers unsure about what benefits the CellML and SBML languages exactly offer, and how to specifically code the model that they are presently researching.[9] Therefore it is suggested that in addition to improving the language’s framework, additional emphasis should be placed on the promotion of SBML and CellML, and its ability to solve current model portability problems. VI. CONCLUSION SBML and CellML are designed to structure data in a way that allows easy portability from one computational tool to another. Although their respective architectures are different, both languages perform the designated task to a certain degree. Due to the different data structures, research groups studying biological pathways would prefer to use SBML, whereas CellML is usually applied to a more diverse range of models. If both languages are to continue to improve the portability of models, it is important they must freely share information with each other. This can lead to future implementation and improvements of features developed by the other language, and the shared resources in developing new features. These features should include the use of a standard ontology. When the designers add any functional changes to their respective language, they must ensure the usability of their language is still much simpler than rewriting an existing models code. A failure to maintain this simplicity should be considered a failure in the design. In addition to making functional improvements, the language creators must also concentrate on promoting the use of their language and the benefits the languages offer to the systems biology research community. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] M. Hucka et al, “The Systems Biology Markup Language (SBML): a medium for representation and exchange of biochemical network models” in Bioinformatics, vol. 19, no 4 2003, pp. 524-531 W3C Architecture Domain, “Extensible Markup Language (XML)”, August 2003, http://www.w3.org/XML/ Finney et al, “Systems Biology Markup Language (SBML) Level 2: Structures and Facilities for Model Definitions”, June 2003, [Online]. Available: http://aleron.dl.sourceforge.net/sourceforge/sbml/sbml-level2-v1.pdf University of Auckland, “CellML.org – What is CellML?”, July 2003, http://www.cellml.org/public/about/what_is_cellml.html W. Hedley, “CellML Specification”, August 2001, [Online]. Available: http://www.cellml.org/public/specification/20010810/cellml_specificati on.html W. Hedly, “SBML to CellML translation”, Presented June 2001, [Online]. Available: http://www.sbwsbml.org/workshops/third/warren/200106_sysbio_caltech_sbml_cellml_ translation.pdf W. Hedly, “Meeting Minutes29 January 2001 – Meeting with the SBML Team at Caltech”, January 2001, [Online]. Available: http://www.cellml.org/private/progress_reports/20010129_meeting_min utes.html M. Ettinger, “The Complexity of Comparing Reaction Systems” in Bioinformatics, vol. 18, no 3 2002, pp. 465-469 [9] Unknown – Discussion Board, “Portal – Summary-2002-07-01”, [Online]. Available: http://portal.bioengineering.elyt.ods.org/matt/cellml/VariousThoughts/S ummary-2002-07-21/view