abstract - Engineering Computing Facility

advertisement
INVESTIGATING THE PROMISE OF A
PORTABLE BIOLOGICAL MODEL
Jason Smale
ABSTRACT –
The use of computational tools to model and analyse
research has become a regular practice for today’s system
biologist. Unfortunately this use of technology has its own
drawbacks, which can be very frustrating to the budding
researcher. One of these frustrations includes the
difficulty one experiences when attempting to transport a
biological model developed on one computational tool to
another. This paper discusses the problems presented
when biological models are not portable across different
computational tools. The author then investigates how
two existing frameworks attempt to structure the data
contained in a model to allow for portability between
tools. Recommendations are then provided for the
development of the two languages in hopes of creating
even greater portability.
I. INTRODUCTION
The recent explosion of interest in studying biological
networks at a systems level has resulted in the development of
a variety of computational tools used to model and analyze
the systems biologist’s research. Systems biology research is
performed in many different areas of biology all around the
world, consequentially a very diverse mix of computational
tools has been created.
The diversity of these software tools has lead to numerous
problems. For example, simulation models created by one
research group cannot be easily compared or shared with
another research group, because the tools developed by these
groups are incompatible with each other. To obtain the
complementary resources of the other group, the researcher
must manually re-encode the model into their tool, which is a
time-consuming and error-prone process.[1] In addition,
models published in journals often include instructions for
obtaining the model definitions, but if the author uses a
different software tool as their modeling environment,
examining the model definitions in the readers modeling
environment will prove to be very difficult.[1] As the field of
systems biology matures, and the biological models become
more complex, these problems will become more detrimental,
since researchers will increasingly need to communicate their
complex results as computational models instead of ‘simple’
block diagrams.
In light of these problems, researchers realized there was a
great need for a specific modeling framework, thus enhancing
a model’s portability from one computational tool to another.
Both the Systems Biology Markup Language and the Cell
Markup Language were designed to be specific modeling
frameworks,
allowing
greater
interaction
between
computational tools, thus enabling the users to spend more
time on their research instead of struggling with model data
formatting issues.
II. SYSTEM BIOLOGY MARKUP LANGUAGE
The development of the System Biology Markup Language
(SBML) was a project headed by the ERATO Systems
Biology Workbench Development Group.
SBML was
designed to facilitate the information exchange between two
analysis packages regarding biochemical reactions, cell
signalling pathways, metabolic pathways and many other
topics commonly found in the system biology research field.
A. Approach to its design
Simply put, SBML is a framework, which uses eXtensible
Markup Language (XML) as its language base to accurately
structure, store and allow for easy transmission of data to
other applications. XML was chosen to be used as the SBML
language base since XML is a software and hardware
independent language used to structure, store and easily send
data over the internet.[2] Even at the onset of the SBML
project, XML already had obtained widespread acceptance as
the standard data language for bioinformatics. [1]
SBML Level 2 was created in conjunction with the authors of
the following systems:
BASIS, Bio Sketch Pad,
BioSpreadsheet,
BioSpice,
CellDesigner,
Cellerator,
COPASI, DBsolve, E-CELL, ESS, Gepasi, Jarnac, JDesigner,
JigCell, MCell, NetBuilder, PathScout, ProMoT/DIVA,
StochSim, and Virtual Cell.[1] The input from these authors
was of great importance because the data encapsulated in the
SBML model was intended to be used in each of these
computational tools. It is important to note, the definition of
the SBML model does not specify how each application will
read/write the data encapsulated in SBML.
The
computational tool is responsible for translating internal data
structures to and from SBML. SBML simply ensures the data
is structured in such a way that any program listed above is
able to interpret it.
B. Overview of the Components used in SBML
This section summarizes information found in the SBML
Level 2 Specification, highlighting the major components of
SBML, so the reader can gain a general sense of its
capabilities.
For a more detailed description of each
component, please refer to the SBML Level 2 specification
[3].
In abstract terms, a chemical reaction takes place inside a
specific volume or compartment and involves a species
(reactants, products), rate laws, and different parameters in the
rate laws. Therefore an SBML model is simply comprised of
lists of the following components, and its architecture is
demonstrated in Fig. 1.
Function definitions - This attribute allows functions to be
called by a meaningful declared name throughout the rest of
the model. The function definition is also used to interface
any general mathematical functions that are to be used in the
model with the Math Markup Language (MathML), which is
complementary to SBML and allows the implementation of a
variety of mathematical operations.
present. In addition, the user has the option to declare its
initial amount or concentration, units, charge present (if it is
an ion), and whether the species concentration or amount is
fixed or dynamic throughout the simulation. If the species is
fixed, this implies there is an external mechanism or source
that maintains the constant quantity of the species in the
compartment throughout the course of the reaction.
Parameter definitions – This component declaration
associates a name to a quantity, making the code more
readable since the programmer is referring to meaningful
names instead of values. SBML allows the declaration of
global parameters (used in any reaction in the model) or local
parameters (used in only one reaction). This declaration area
involves the declaration of global parameters only. Local
parameters are defined in the reaction module in which they
are involved.
Reaction definitions – The reaction component is a list of
reactant and product species that are involved in the modeled
reaction. A kineticLaw attribute is used to provide an
algorithm describing the rate at which each reactant combines
to form each product. The kinetic Law attribute uses a
MathML expression to accurately model the rate. In the
reaction component you also have the ability to declare if the
reaction is a ‘fast’ reaction, which is relevant when computing
equilibrium concentrations of rapid reactions.
FIG. 1 – Pictorial Representation of the SBML Framework
Unit definitions – Special expressions of quantities used
throughout the model are defined in this component. An
example would be the declaration of mL/s or nΩ/˚C. These
units are constructed by combining standard units found in the
SBML unit database. The formula to define a user defined
unit is as follows:
unit new  (scalar 10 scale  unit base )  offset
The scalar multiplier may be used to convert inches to
millimeters. The 10scale may be used to convert meters to
millimeters. The exponent may be used to represent a volume
or area, and finally the offset may be used in the case of
representing degrees Fahrenheit instead of Celsius.
exponent
Compartment definitions – A compartment in SBML
represents a bounded volume in which species are located. If
there is no compartment defined the model is assumed to be
located within a single unit volume. The compartment
declaration allows the user to assign a name, units, and a total
volume to the compartment. The user also has the option to
declare a compartment as simply surrounding another
compartment of a specific size. This type of volume
allocation would be required when referring to the volume of
the nucleus and the cell’s cytoplasm surrounding the nucleus.
Species definitions – The species element in SBML is used to
represent entities such as ions and molecules that participate
in the reaction. An example could include anything from
glucose to RNA. When declaring a species element, the user
must define its name and the compartment in which it is
List of rules (optional) – A rule is used to establish
mathematical constraints on parameters for the specific cases
where the constraints cannot be included in the reaction
algorithm. An example would be adding saturation limits to a
specific parameter.
List of events(optional) – An event is a statement describing
an instantaneous or discontinuous change to any component
(parameter value, compartment size, species concentration)
which can be triggered in response to a specific state of the
model.
SBML (Level 2 only) also allows the user to make use of the
metadata attribute. Simply put, metadata is “data about data”,
and provides information to facilitate searches of collections
of models. The user includes descriptive information about
the model in the preliminary definition metaid field, which
can help other modellers determine whether it would be
worthwhile using the model for their own research.
III. CELL MARKUP LANGUAGE
The development of the Cell Markup Language (CellML) was
a project headed by the BioEngineering Institute at the
University of Auckland, New Zealand. Simply stated, the
purpose of CellML is to structure, store and exchange
biological model data between multiple applications used for
simulation and analysis.
A. Approach to its design
Similar to SBML, CELLML is a framework, which uses
XML as its language base to accurately structure, store and
allow for easy transmission of data to other applications.
CellML is designed to support the models of cellular and subcellular processes.
The scope of the CellML language is specifically limited to
how the parts of a model are related to one another both
physically and logically. MathML and RDF are employed by
CellML to represent mathematical equations describing
biological processes and metadata, respectively. For more
information regarding MathML and RDF please refer to the
CellML website, since they fall outside of the CellML scope
and will not be discussed in this document.[4]
B. Overview of the Components used in CELLML
This section summarizes information found in the CellML
Version 1.0 Specification, highlighting the major components
of CellML, so the reader can gain a general sense of its
capabilities.
For a more detailed description of each
component, please refer to the CellML Version 1.0
Specification [5].
CellML designers decided that any model could be described
as a network of connections between self-contained
components. This design concept is responsible for CellML’s
flat model architecture. In fact, a CellML model only
contains four element definition types, units, components,
groups, and connections.
The CellML is known as a flat architecture, because there are
only three elemental layers to a CellML model. The top layer
is simply the model.
The model is segregated into
components. These components are then further segregated
into variables, as demonstrated in Fig. 2 shown below.
In CellML, a hierarchy refers to the relationship of all the
components in the model. The modelled components form a
“parent-child tree” with respect to a specific grouping scheme.
In CellML, there are two types of grouping schemes,
encapsulation and containment. Encapsulation refers to
components grouped in a logical hierarchy, whereas the
containment grouping refers to a physical hierarchy, for
example the nucleus would be a child component of the cell.
Using encapsulation grouping, models become more
structured since the only exchange of information is between
the parent and child components or between sibling
components. This is similar to the efficient and structured
procedure in which the internet passes its information. When
a client wishes to receive some information, it sends a request
to a router, which is the equivalent of a parent component in
the CellML model. The router then relays the request onto
the appropriate server (child component) who then returns the
answer to the router, who relays it back to the client. This
model architecture is especially effective if a modeller wishes
to re-use an encapsulated sub-model, they would simply treat
the sub-model as a black box, which would only have to
interface with the parent component. An encapsulated
hierarchical relationship is demonstrated in Fig.2.
Fig. 2 - Pictorial Representation of the CellML Framework
As mentioned before a component contains variables that
have a biological significance in the model. When defining a
variable the user has the option of declaring its initial values
and their connection properties. Connection properties refer
to which other components (sibling or parent) can access or
even alter the variable’s present value. The component may
also contain equations that modify the values of these
variables. These equations may be represented quantitatively
using a MathML mark-up or can be declared in the reaction
attribute of the component used to express qualitative
reaction. The reaction attribute is also used when MathML
cannot accurately model the reaction and additional rules need
to be implemented to better describe the reaction. For
example, it is recommended that the user use the reaction
attribute instead of MathML when the stoichiometry of an
equation must be considered. In the reaction attribute, each
variable reference must specify that variables role in the
reaction (reactant, product, catalyst, etc). Also in the
component element of the model, the user has the option to
define the metadata regarding the component. This is
different than SBML, where metadata was used as a reference
for the overall model.
A model built from many components encourages the re-use
of components. For example, if a researcher is modeling a
cell’s membrane, all of the equations and the variables’
connectivity could be defined in one Na+ channel component
and then simply connected to the proper parent and sibling
components to create an accurate model.
CellML’s unit definitions are very similar to the way SBML
units are defined, but the scale attribute is renamed as the
prefix, which is entered as a string in CellML. To better
illustrate the use of the prefix string, one would enter ‘centi’
into the code instead of ‘10-2’ to obtain the unit cm. The user
defined unit declaration equation is shown below:
exp onent
unit new  (multiplier  prefix  unit base
)  offset
IV. DIFFERENCES BETWEEN THE TWO MARKUP
LANGUAGES – CellML’s framework architecture is very
different from SBML’s architecture. Generally speaking,
SBML’s use of compartments is the same as using a
containment hierarchy in CellML, but SBML does not have
the capability to logically organize its species/compartments
in a way that is similar to CellML’s encapsulation hierarchy.
Due to the absence of the encapsulation hierarchy in the
SBML language it becomes very difficult to structure a very
large model with many different species and reactions in one
compartment. For example, when modelling a cell membrane
with various ionic channels, enzyme receptors, molecule
carriers and a variety of other species, there is no way to
check whether a Na+ has been coded to influence a K+
channel as a mistake other than searching the hundreds or
possibly thousands of lines of code. The encapsulation
hierarchy eliminates this possibility because the modeller
would make the Na+ channel the parent and the Na + the child.
Also, CellML allows the modeller to re-use a component,
whereas in SBML, all species must be unique. As a result, if
you wish to have 100 Na+ channels in the membrane, for
SBML you must have 100 different sodium channel names,
whereas in CellML, you only need one, making
troubleshooting and overall code organization much simpler.
Another difference between the two markup languages is
SBML’s ability to declare a reaction as a ‘fast’ reaction. If a
model was to be converted from SBML to CellML this
Boolean flag would be ignored and the simulation package
would have to use an ODE to calculate the reactions
concentration.
Because of these differences, the research community has
generally accepted that SBML is used to structure models
which describe biological pathways, whereas CellML has a
more flexible structure, allowing it to describe any biological
system mathematically.[6]
In fact, CellML is being
developed in consideration with two other markup languages
that are under development at the University of Auckland.
These two languages, AnatML and FieldML are intended to
structure models of entire organs to structure and store
information pertaining to the physical distribution of
parameters for each of the components used in CellML,
respectively. The goal is to integrate these markup languages
together, to provide a complete framework that would
accurately describe all models sizing from a sub-cellular
reactions to an organ. Although this goal is ambitious, the
researchers at University of Auckland consider it viable.[4]
V. THE ROAD AHEAD
A. CELLML’S AND SBML’S PAST HELPS THEIR FUTURE
If both CellML and SBML are to become comprehensive
model definition languages, their developers must share their
ideas for two main reasons. The first reason is the other
language’s developers are a great resource for ideas and
improvements that could be implemented into your language,
making the modelling framework capable of handling more
complex models. Thankfully, communication between the
SBML developers and CellML developers has been very open
and was the catalyst for SBML introducing the use of
metadata and MathML into their second version.[7] A second
reason the ideas must be open to both parties is that as the
languages become more complex to describe more complex
models, it is extremely important that the model described by
one language can still be translated to the other language.[7]
An example of this need arises because Virtual Cell uses
CellML as their model definition language and Darpa
BioSPICE uses SBML. If Virtual Cell was interested in a
model that Darpa had developed, their program must be able
to easily translate the SBML framework to the CellML
framework with minimal loss of data and model description.
Remember, the second reason the developers must work
together, is the underlying reason for the need of the model
definition languages in the first place.
B. IMPROVEMENTS FOR BOTH LANGUAGES
Not only is it important for the developers to continue to share
existing functionalities, it is also important that they share
ideas and knowledge regarding their future plans of
development. This is important because as each language’s
individual framework improves the developers encounter
common problems. In the opinion of the author of this paper,
the most notable are the present inability to accurately and
easily model a stochastic biological process, and integrating
the use of the ontology of biochemical reactions presently
being defined by researchers all over the world.
The ontology problem is extremely important. Without the
use of a proper ontology, models, which are filed away in
databases, may not be found by a wanting scientist because
(s)he used a different annotation to search for a particular
element. This problem is eloquently presented in a paper by
Mark Ettinger where he highlights how complications can
arise from the usage of different variable names and reaction
sequences, when describing the exact same models.[8]
Although metadata is a start to overcoming this problem, it is
not the complete answer. To properly standardize a model’s
ontology, all parties (researchers, ontology markup language
developers, SBML developers, CellML developers and
database developers) must work together and all must agree
that the standard annotation can be implemented into their
respective work.
Not only is producing a more comprehensive model definition
language important to completing the developers goal of
creating a portable model framework, they must ensure that
the language is still relatively easier to use than having a
researcher simply re-writing an existing model in their own
computational tool. A good example of an application that
has developed into a large program able of modelling a very
diverse range of systems, but still maintains its original
simplicity is MATLAB. MATLAB originally was created to
compute complex mathematical equations, without the user
having to code in FORTRAN, therefore making the users job
easier. Today MATLAB has the ability to do anything from
modelling digital communication systems to calculating a
bottom line for a financial institution. This is done through
the use of toolboxes, which help MATLAB maintain its user
friendliness. But, regardless of what you are using MATLAB
for, it is still considered to be very user friendly relative to
other alternatives. The author is not suggesting that CellML
and SBML should introduce toolboxes into their framework,
but instead stresses the importance of maintaining the
frameworks initial simplicity when adding more options to
allow the model language to be more comprehensive.
Another important area that SBML and CellML must address
to achieve their goal of portability, is to promote the usage of
their languages in the scientific community. So far they have
done this by giving presentations and distributing
documentation all over the world (on the web, at conferences,
etc.). The have also provided example models on their home
websites and ensured that their websites are informative.
Unfortunately there still seems to be many researchers unsure
about what benefits the CellML and SBML languages exactly
offer, and how to specifically code the model that they are
presently researching.[9] Therefore it is suggested that in
addition to improving the language’s framework, additional
emphasis should be placed on the promotion of SBML and
CellML, and its ability to solve current model portability
problems.
VI. CONCLUSION
SBML and CellML are designed to structure data in a way
that allows easy portability from one computational tool to
another. Although their respective architectures are different,
both languages perform the designated task to a certain
degree. Due to the different data structures, research groups
studying biological pathways would prefer to use SBML,
whereas CellML is usually applied to a more diverse range of
models. If both languages are to continue to improve the
portability of models, it is important they must freely share
information with each other. This can lead to future
implementation and improvements of features developed by
the other language, and the shared resources in developing
new features. These features should include the use of a
standard ontology. When the designers add any functional
changes to their respective language, they must ensure the
usability of their language is still much simpler than rewriting
an existing models code. A failure to maintain this simplicity
should be considered a failure in the design. In addition to
making functional improvements, the language creators must
also concentrate on promoting the use of their language and
the benefits the languages offer to the systems biology
research community.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
M. Hucka et al, “The Systems Biology Markup Language (SBML): a
medium for representation and exchange of biochemical network
models” in Bioinformatics, vol. 19, no 4 2003, pp. 524-531
W3C Architecture Domain, “Extensible Markup Language (XML)”,
August 2003, http://www.w3.org/XML/
Finney et al, “Systems Biology Markup Language (SBML) Level 2:
Structures and Facilities for Model Definitions”, June 2003, [Online].
Available: http://aleron.dl.sourceforge.net/sourceforge/sbml/sbml-level2-v1.pdf
University of Auckland, “CellML.org – What is CellML?”, July 2003,
http://www.cellml.org/public/about/what_is_cellml.html
W. Hedley, “CellML Specification”, August 2001, [Online]. Available:
http://www.cellml.org/public/specification/20010810/cellml_specificati
on.html
W. Hedly, “SBML to CellML translation”, Presented June 2001,
[Online]. Available: http://www.sbwsbml.org/workshops/third/warren/200106_sysbio_caltech_sbml_cellml_
translation.pdf
W. Hedly, “Meeting Minutes29 January 2001 – Meeting with the
SBML Team at Caltech”, January 2001, [Online]. Available:
http://www.cellml.org/private/progress_reports/20010129_meeting_min
utes.html
M. Ettinger, “The Complexity of Comparing Reaction Systems” in
Bioinformatics, vol. 18, no 3 2002, pp. 465-469
[9]
Unknown – Discussion Board, “Portal – Summary-2002-07-01”,
[Online]. Available:
http://portal.bioengineering.elyt.ods.org/matt/cellml/VariousThoughts/S
ummary-2002-07-21/view
Download