Abstract

advertisement
Systems biology reports
Modeling and Simulating Biological Networks
-Glycolysis Case studyHsiang-Yuan Yeh(葉向原) CHIH-YU CHEN(陳芝妤) Chien-Chih Tu(杜建志) WEI-CHIH LIN(林威志)
Department of Information Systems and Applications, National Tsing Hua University
101, Section 2, Kuang-Fu Road, Hsinchu 300, Taiwan, R.O.C
d926708@oz.nthu.edu.tw, g926741@oz.nthu.edu.tw, g937626@oz.nthu.edu.tw,
g936739@oz.nthu.edu.tw
interpret the information and knowledge.
Abstract
Developments in high-throughput
Much biological research results of genomic
measurement technologies for biology have
sequences and biological pathway become
created a paradigm shift in modern life
available in certain electronic forms via
science research. The filed of systems
Internet or Webs. In addition, NCBI has
biology provides system-level understanding
made its MEDLINE database web
functional genomic and pathway. I attempt
accessible that currently contains over 14
to build a mathematical and chemical
million citations for biomedical abstracts
framework with some conditions and
back to the 1950's, growing by more than
biological knowledge to dynamically model
40,000 abstracts each month. The abstracts
and simulate the biological network. This
of MEDLINE provided a very rich
project leads to understand complex
biological knowledge base to be searched by
biological systems and also speed up the
biologists. However, the retrieval of an
drug discovery.
abstract from MEDLINE still used the
keyword-based
retrieval
techniques.
1. Introduction
The most important research of the
Biologists often have to manually read
human genome project is pushing scientists
through the details in the texts retrieved in
to join a new view of biology. In biology,
order to find the information they actually
there are many questions we had not solved.
needed.
Bioinformatics is the part of molecular that
1.1 Systems biology
involves working with the biological data
We could extract simple interactions
typically using computational processing.
from the literature, but there is a challenge
With Bioinformatics growing up, speed up
for the life sciences is to understand the
the drug discovery and biological researches.
relationship of the component in the living
Modern molecular biology and medical
systems. Genetic networks are more
research involves amount of data, and the
complex for biologists to analyze.
quantity of those data has grown
Bioinformatics and genomics have identified
exponentially. Much of the biological data
many components that make up a living cell.
reported in the literature have not been
It helped us know gene function and
captured in the database. Many interactions
structure from sequence pattern recognition.
unlike sequence data were found in on-line
In the next decade, the research area will
database. They are reported in the scientific
interest in discovering the complex behavior
journals in free-text formats. So, in my
that underlies the development and disease.
previous research is focused on extracting
In the area, a major problem is networks of
molecular
interactions
in
biological
the cellular process are worked through
literature, reference to the technique reports.
complex interactions among a large number
Due to the automated genomic and
of genes, proteins, and other molecules.
proteomic sequence analysis, the gigantic
Systems biology studies on biological
amount of biological data and knowledge
process as whole systems, and focus on
produced has made great challenges not only
gene-gene,
protein-protein,
cell-cell
to the biology field itself but also to the new
interactions. It provides the experiment valid
information technology that is needed to
models to construct and predict the behavior
assist biologists to process, analyze, and
of the biological systems. Because these
reasons, we should merge systems biology
in bioinformatics. Using system-level views
understand the dynamic processes.
Biotechnology leads to be larger, and
more complex. The systems biology
community developed information standards
for sharing and developed cooperatively.
The Systems Biology Markup Language
(SBML) is a computer-readable format for
representing models of biochemical reaction
networks.
Figure 1 MAPK signaling pathway
1.2 Pathway
Genes are expressed at varying rate
throughout the life of a cell. The expressions
of the different genes vary continuously per
second. Pathways are the conceptual
networks of the molecular and cell biology
described the interaction and intra- and
inter- cellular dynamics. They are the key
points to understand how the organism
reacts in the environment or internal changes.
Biologists concern with biology pathways
(metabolic pathway, regulatory pathway,
signal transduction pathway) by searching
the database, such as Kyoto Encyclopedia of
Genes and Genomes (KEGG), Alliance for
Cellular Signaling (AfCS) and Signal
Transduction Knowledge Environment
(STKE). Pathway consisted of large number
modules between receptor, genome and
many diseases such as cancer. So, it is hard
to be detected the overall biological
interactions. Therefore, pathway database
would not update and maintain the data
quickly. The figure 1 shows the MAPK
signaling pathway in KEGG. A graph of
node (genes, enzymes) connected by edges
(interactions). These relationships between
the genes may be the transcriptional
regulation or another interactions.
2. Motivation and impact
When the amount of retrieved texts turns
out to be plenty, it becomes a laborious task.
Therefore many works based on natural
language processing techniques that can
automatically extract knowledge of a
specific domain had been proposed. Among
the works, some intended to extract
information or relations directly from plain
texts based on purely natural language
processing techniques. Many systems
utilized the syntactic and semantic grammar
to help to extract knowledge.
Many biologists waste a lot of time to try
an error by doing experiments. It is not a
efficient way to understand biological
meaning and also waste a lot of money. If
we can combine biological knowledge and
use parameters to do the simulation before
the experiments, it may speed up the
biological research, and even support
reliable results for biologists. According to
the reason, we have some problem to be
solved as following:
1) Where and how to gather the biological
knowledge for the experiments?
2) How to do the simulation if you have
already got some biological information?
3. Proposed solution
3.1 Semantic web and ontology
Ontology is a formal language that
explicitly describes abstract concepts and
relationships among domain specific objects.
It provides a common model of relationships
among the terms in a domain vocabulary for
representing and sharing of domain-specific
knowledge. Nowadays, ontology was used
in many areas of research and applications
include the bioinformatics community. We
believe it is important to extract information
efficiently to combine the syntactic and
semantic
grammar.
Some
ontology
development is heading toward building a
conceptual hierarchy or relational network
to classify a set of domain concepts or
terminologies. The kind of ontologies are
usually known as thesauri. Previous work
such as WordNet , MeSH (Medical Subject
Heading), and more recently GO
(GeneOntology) were some examples. Other
more sophisticated ontology constructions,
on the other hand, are heading toward
establish schemes or classes to capture more
elaborate relations among domain objects.
To describe the ontology, the semantic web
community, has provided a set of standards
known as OWL, RDF/RDFS that are based
on XML format. We argue that the use of
thesauri and ontology together can
significantly improve the performance of
information extraction. We show how the
domain ontology and thesauri can be
integrated in the information extraction
process. In systems biology, there is a
biological knowledge behind the systems.
The Gene ontology has built ontology
described molecular function, biological
process and cellar components. Figure 2
shows the structure of Gene ontology. Mesh
had hierarchical categories for the
components of the biology. That ontology is
the tool for the unification of biology, and
help people doing the biological vocabulary
semantic
annotation
and
function
classification.
Figure 2 Gene ontology
3.2 Web service
There are many biological databases in
the Internet. But it is hard to integrate the
knowledge from those databases. We use the
web service technique to deal with these
problems. However, instead of being
restricted to a single purpose, the
agent-based system must be flexible and
extensible. By being flexible the system can
accept different goals from the biologist if
the goal is compatible with the capabilities
of the analyzing tools. By being extensible
the system can be enhanced with additional
tools and knowledge without need to change
the system architecture. The Web service
approach is introduced to satisfy these two
desirable properties. Information about how
to operate a tool and what the tool does
should be distributed to the description of
each service rather than hard coded within
the agent code. The agent must be able to
consult an external biological ontology,
written in OWL to understand the meanings
of the service descriptions, written in
OWL-S. The service descriptions and
ontology enables reasoning about the
capabilities of the tools. Thus this
architecture is extensible because new
processing tools can be wrapped as Web
services and added to the system, and
flexible because a Web service can be used
in various circumstances that change with
the processing goal.
3.3 Biological pathway simulation
Many chemical reactions are involved
in the biological pathway. It also simulates
the kinetics of systems of biochemical
reactions and provides a number of tools to
fit models to data, optimize any function of
the model, and perform metabolic control
analysis and linear stability analysis. Copasi
is a software package for modeling
biochemical systems. Copasi simplifies the
task of model building by assisting the user
in translating the language of chemistry
(reactions) to mathematics (matrices and
differential equations) in a transparent way.
This is combined with a set of sophisticated
numerical algorithms that assure the results
are obtained fast and accurate.
4. System Architecture
For making the biological research
more quickly and correctly. We have an idea
to integrate the database, biological tools,
modeling and simulating pathway toolkit.
Figure 3 is our global system architecture.
Our system consist of the four module as
following:
Pathway modeling agent
Model
the pathway
according to
the promoter
and
molecular
interactions
Extract
the
molecular
interactions
and
Chemical
coefficients
Quantitative Simulation agent
Bio-ontology
& thesauri
According to
the
Quality of Service
and use’s goal
to make the
Biological plan
Literature extraction agent
Measure the
chemical values
by calculating
the coefficients
and
pathway
structure
Workflow Planning agent
Information wrapper Agent
Information Gathering
Web service Matchmaker (Broker Agent)
Connect the service
Database (KEGG, NCBI, Micro-array)、Bioinformatics Toolkit
Figure 3. Global Agent Archiecture
1) Web service - information wrapper
agent
There are many heterogeneous databases
for the biology, such like bioinformatic
toolkits, gene database, protein database,
and pathway database. We develop the agent
environment accessing and filtering this
information automatically by using web
service technique. Then, biologists do not
open all the web page and know how the
program works.
2) Literature extraction agent
Relational database system for managing
kinetic data, chemical structure, pathway,
and
chemical
reaction
provide
stoichiometric information and parameters
for kinetics equations to the model. The
concentration, reaction rate…etc in the
previous works. We will extract this
information automatically using literature
extraction agent.
3) Pathway modeling agent
KEGG has provided many pathways to
biologists, but still have many molecular
interactions in the pathway. Micro-array is a
high throughput tool to get the gene
expression. There also are many interaction
and parameters in the biological literature.
We will gather the information and try to
model the pathway from many resources.
4) Workflow planning agent
We integrate many kinds of the
bioinformatics toolkits to do the complex
works. When the user set the query, the
agent will find the proper tool and construct
the workflow to reach the goal. Gene
annotation is a complex work for an
unknown sequence. It should find the open
reading frame and compare the other
proteins from BLAST, and check the protein
domain from another tool such like
InterproScan, Pfam…etc.
5) Quantitative simulating agent
In quantitative simulating agent, he stores
a lot of the formula of the chemical reactions.
He also has knowledge to know which
formula will be used for the different
conditions. It calculates the coefficient of
each compound in biological pathway and
some constraints about chemical reactions.
6) Shared Biological Ontology
Ontology presents the concept of the
pathways and reactions. The systems would
know every compounds and enzymes in
those pathways. There are not only lots of
strings, and have some semantic meaning
for the machine.
7) Experiments and Methods
The figure 4 shows the workflow of the
biological pathway modeling and simulation.
We will discuss the workflow later.
Kinetics
database
Get the kinetics coefficient
from the experiments
or literature
Biological
database
Get the gene name,
chemical compound and
its physical information
chemical
database
Get the chemical reaction
Pathway
database
Get the biological pathway
Dynamic model
Stoichiometric model
Figure 4 Workflow description
6.1 KEGG web service
KEGG API provides valuable means
for accessing the KEGG system, such as for
searching and computing biochemical
pathways in cellular processes or analyzing
the universe of genes in the completely
sequenced genomes. The users can access
the KEGG API server by the SOAP
technology over the HTTP protocol. The
SOAP server also comes with the WSDL,
which makes it easy to build a client library
for a specific computer language. The web
service functions showed in Figure 5 and 6.
Figure 5 KEGG web service query
Figure 6 KEGG web service results
6.2 Ontology-based knowledge extraction
Extracting knowledge directly from
natural language plain texts is not a trivial
task. Its difficulty comes from the reason
that the relations or knowledge to be
extracted can be embedded deep in the
sentences that cannot be easily extract from
simple keyword matching methods without
sophisticated
inference.
To
extract
knowledge exactly can be as hard as a task
of the deep understanding of a natural
language text. To reduce the difficulty of
knowledge extraction directly from natural
language texts for intelligent software agents,
we conduct the automatically semantic
annotation tasks before the actual
knowledge extraction process. We use the
sentence ”The pyruvate concentration that is
required to accommodate a flux of 0.48
C-mol/min*L-cytosol, is 8 mM.” as
example.
Step 1. Automatic semantic annotation using
thesauri
After annotating, we will get the results:
The<SW.N>
pyruvate<ME.D>
concentration<ME.F>
that<WN.AD>
is<WN.V> required<WN.V> to<WN.AD>
accommodate<WN.V>
a<SW.N>
flux<WN.N>
of<SW.P>
0.48<U>
C-mol<SI>/<SI>min<SI>*<SI>L-cytosol<S
I>, is<WN.V> 8<U> mM<SI>.<PU>
The semantic codes <ME.D>, <ME.C>,
<ME.Q>
indicate
Descriptors,
Supplementary Concept, and Qualifiers in
MeSH respectively; and <SW.A>, <SW.P>,
<SW.C>, and <SW.N> indicate Articles,
Prepositions, Conjunctions, and Nouns in
the Stop-word list respectively; <WN.A>,
<WN.AD>, <WN.V>, <WN.N> indicates
the Adjective, Adverb, Verb, and Noun in
WordNet respectively; <U> indicates the
Number; <SI> indicates Metric System Unit;
<PU> indicates the Punctuation Mark;
<ABBR> indicates the Abbreviation; and
<GO.FUN>, <GO.PRO>, <GO.COM>
indicate Molecular Function, Biological
Process, and Cellar Component in
GeneOntology respectively.
Step 2.The pattern grammar rules
We have some pattern rules to combine
the single words to the phrases. The example
will get the semantic tag like this:
The
pyruvate<ME.D>
concentration<ME.F> that<WN.AD> is
required to<WN.V> accommodate<WN.V>
a flux<WN.N> of<SW.P> 0.48<U>
C-mol/min*L-cytosol<SI>,
is<WN,V>
8<U> mM<SI>.<PU>
Step 3. Map the syntactic grammar to the
semantic structure in the domain ontology
After pattern matching, we substitute
the proper noun phrase patterns with special
variable symbols “x’s” to simplify the
subsequent
parsing
processes.
This
substituted sentence is given to Minipar
parser that yields a dependency tree as
shown in Figrure 7. It extract the subject and
object in syntactic level.The parsing results
are then used to extract the relationships in
the sentence in the format of the RDF-triple
<Subject, Predicate, Object>.We combine
the syntactic and semantic relationships
mapping to the semantic template defined in
the domain ontology. We can extract the
triple (The pyruvate concentration, is, 8 mM)
and (The pyruvate flux, is, 0.48
C-mol/min*L-cytosol)
(
(()
U
*
)
(()
fin C
E2
)
1
(The
~ Det
3
det
(gov
concentration))
2
(pyruvate
~N
3
nn
(gov concentration))
3
(concentration ~ N
5
s
(gov require))
4
(is
be be
5
be
(gov
require))
5
(required
require V
E0
i
(gov fin))
6
(to
~ Aux
7
aux
(gov
accommodate))
7
(accommodate
~V
5
sc
(gov require))
E3
(()
concentration N 7
subj
(gov accommodate)
(antecedent 3))
8
(a
~ Det
9
det
(gov
flux))
9
(flux
~N
7
obj
(gov
accommodate))
10
(of
~ Prep 9
mod
(gov
flux))
11
(0.48
~N
16
num
(gov C-mol/min*L-cytosol))
12
(C
~U
16
lex-mod (gov
C-mol/min*L-cytosol))
13
(~U
16
lex-mod (gov
C-mol/min*L-cytosol))
14
(mol/min*L
~U
16
lex-mod (gov C-mol/min*L-cytosol))
15
(~U
16
lex-mod (gov
C-mol/min*L-cytosol))
16
(cytosol
C-mol/min*L-cytosol N
10
pcomp-n (gov of))
19
(8
~U
20
lex-mod (gov
"8 mM"))
20
(mM
"8 mM" N
E2
)
E2
E0
Figure 7 the results of the minpar
Step 4. Convert the plain text into OWL
format
In this step, the system assigned
semantic categories in terms of domain
ontology to the phrase terms in the triples
that were extracted by the parser. It wastes a
lot of time to conduct semantic annotation
automatically, so the system converts each
sentence into OWL instance format.
6.3 Modeling and Simulating Biological
pathway – Glycolysis
In this project, we used biochemical
network simulation software, Copasi, to
simulate the process of glycolysis in figure 8.
Due to the lack of the enzyme, EC4.1.1.1, in
E. coli, we resorted to use the data from
Yeast to do the simulation in order to
compute how the concentration of ethanol
can be changed. Copasi is the newly
improved version of an older and commonly
used biochemical simulator called Gepasi.
The following figure 10 shows the interface
of Copasi and the configuration options for
the network. In Reactions, all the reactions
involved in our selected pathway are
inputted here.
Figure 8 Glycolysis
Figure 9 Differential equation of glycolysis
Figure10 Reaction Inputs Overview
By clicking onto the desired reaction on
the left, we can examine the details of the
specific reaction. Figure 11 below shows the
details of one of the reactions. The enzyme
that catalyzes the selected reaction was
chosen to represent the reaction name. In
this particular reaction, P2G is the only
substrate and PEP is the only product. The
parameters listed are V, Ka for the substrate,
Kp for the product and Keq. The values are
predetermined so the rate equationcan be
used.
In the equation, a and p represent the
concentrations of the corresponding
substrate and product, respectively. Γ is the
mass-action ratio, p/a, and Ka and Kp are
the Michaelis-Menten constants for a and p.
The equation is specifically used for
reversible reactions with only one substrate
and one product. The rate equation is
inputted under “Functions” on the left
directory
with
the
name,
function_4_vENO_2, and that is chosen for
the Kinetics of this reaction. By
double-clicking on “Functions”, and then
function_4_vENO_2, we get Fig.12, which
contains the formula to calculate the rate v
and the data type of each parameter.
Figure 11 Reaction Input details of the
reaction P2G ←→PEP using enzyme ENO
Fig.12 Details of “function_4_vENO_2” for
P2G ←→PEP using enzyme ENO
For the case of reversible Michaelis
Menten kinetics with two non-competing
substrates and products, another equation is
used to take into account for both substrates
and products:
In this case, a and b represent the two
substrates, and p and q represents two
products. This kinetic equation is used for
enzymes HK, GAPDH, PFK and PYK
Fig.13. Details of “function_4_vENO_2” for
P2G ←→PEP using enzyme ENO.
Only one compartment is needed in our
simulation, therefore all of the metabolites
are set to be situated in uVol shown on
Figure 13. Again we double-click on the
desired metabolite to examine the details.
Initial concentrations can be set from the
right window in figure 13 directly by
selecting the cell and typing in the desired
value. We will be creating plots and reports
for the concentration change over time. Thus,
by changing the initial concentration of any
desired metabolites, prediction for the
ETOH yield can be calculated by the
software. This way we can understand how
concentration of ETOH may be affected by
increasing the concentration of different
metabolites.
Now that all the inputs are set, we can
start to form reports and plots. In figure 14,
we set up items to be reported by clicking on
the plus sign and change the order by
selecting the item and clicking the up and
down arrows. We can also set up plots as an
output by adding a default plot and the
concentrations of metabolites will be the
default output. In figure 15, unnecessary
metabolites can be deleted by clicking
“delete curve”.
Fig.14. Report Definition
Fig.15. Plot configuration
Once we get the output settings out of the
way, the last step is to obtain the result. First
we need to click on ReportDefinition to set
the directory where we want to save our
result and the window from Figure 17 will
pop out. Once the target is chosen, we click
on Confirm and then click “Run”.
Fig.16. Time Course Result
Fig.17. Report Definition Selection
7. Results and Conclusion
The results shown below are outputs of
the plot and the report respectively in figure
18 and 19. Even though they are different
forms of output, they are essentially the
same thing. From the concentration plot, the
lines we do not want to display can be
deselected by clicking on metabolite on the
bottom. The purple line here represents the
concentration of Glucose transported inside,
whereas the red and green lines show the
concentration of ETOH and CO2
respectively.
Fig.18. Result: Concentration Plot
Fig.19. Result: Report saved in a chosen file
To show how [ethanol] changes with
[AcAld] as an example, we increase the
initial concentration of AcAld by 20M. In
figure 20, The concentration of ethanol is
predicted to be increased roughly by 10M
and note the sigmoidal curvature for the red
ETOH line at the beginning due to the
drastic change in concentration of AcAld.
Therefore, by using Copasi, we can estimate
how concentration of ETOH changes with
other concentrations over time.
Fig.20. The concentration of the ethanol
production (red)
We also use the view of the biosynthesis
for the ethanol production in E.coli. Ethanol
production by microorganism was processed
for a long time and different kinds of usage
in the filed. Like cellulose used for raw
material, and ethanol is product. The
effectiveness is the most important that
scientists always considerate toward. In
references, we suppose that biosynthesis of
ethanol use E.coli is good way to carry out.
With this, we can leave out some processes
for improving much more efficient than the
way
traditionally
produced
either
synthetically from ethylene or fermentation
of grain, cellulose or sugar.
The key point of ethanol production by
E.coli is pyruvate which is the important
intermediate in metabolic pathway in figure
21. Comparatively to common metabolic
pathway, E.coli lacks for two enzymes to
catalyze pyruvate to ethanol. Metabolism
from pyruvate to ethanol, acetaldehyde is
intermediate.
Figure 21 Pyruvate metabolism
In Figure 22 shows the enzymes which
E.coli lacks are pyruvate decarboxylase and
alcohol dehydrogenase. With metabolic
pathway and gene database – EcoCyc
(Encyclopedia of Escherichia coli K12
Genes and Metabolism) and NCBI, E.coli
strain K12 have alcohol dehydrogenase
(adhP) in step 2. In spite of strain K12,
E.coli strain KO11 is an ethanol-producing
recombinant in which genes for ethanol
producing were cloned from Z. mobilis (pdc,
adhB). Besides, pyruvate decarboxylase
from Saccharomyces cerevisiae (pdc1) was
successfully isolated and fused to E.coli
indicator gene LacZ and T7 RNA
polymerase promoter phi 10 can express in
E.coli.
predication, treatment), in the future, it may
become more popular for and personal
health care.
Figure 22 Ethanol production from pyruvate
Below are two ways for us to carry out
biosynthesis of ethanol by E.coli:
1) Escherichia coli strain KO11
An ethanol-producing recombinant with Z.
mobilis genes: pyruvate decarboxylase (pdc)
and alcohol dehydrogenase (adhB).
2) Escherichia coli strain K12 with pdc1
cloning
A strain has alcohol dehydrogenase (adhP),
but lacks for pyruvate decarboxylase. So we
can clone Saccharomyces cerevisiae
pyruvate decarboxylase (pdc1) and even T7
RNA polymerase promoter phi 10.
From glycolysis to pyruvate and then
pyruvate to ethanol, we consider the most
partial processes in metabolism that direct or
indirect effect the yield. And control the
substrate or product by inhibits one or more
reaction to get the most ethanol production
by E.coli.
8. Future work
The combination of flux based static
modeling with dynamic modeling based on
kinetic equations. The model can be initiated
as a stoichiometric model that is gradually
converted into a dynamic model by adding
dynamic equations. Flux distribution
analysis as a method for calculating each
flux in stoichiometric models. Substances at
the boundary between dynamic models and
stoichiometric model are influenced by both
fluxes.
Nowadays, biological interactions are
more complex in life science. New kinds of
cancers or virus will grow fast. Only
biologists resisted these problems is not
enough. So, we should combine the
computational biology to deal with a large
amount of data. According to the standards
and the common platform, many researches
about the life science in the world will share
and cooperate together. It would speed up
the drug discovery and drug design. Further
expansion into the clinical diagnostics and
therapeutics
(disease
classification,
Reference
Can yeast glycolysis be understood
in terms of in vitro kinetics of the
constituent enzymes? Testing biochemistry.
Bas Teusink 1, * , Jutta Passarge 2, , Corinne
A. Reijenga 2 , Eugenia Esgalhado Eur. J.
Biochem. 267, 5313-5329 (2000)
Karp, P.D.; Riley, M.; Saier, M.; Paulsen,
I.T.; Collado-Vides, J.; Paley, S.;
Pellegrini-Toole, A.; Bonavides, C.;
Gama-Castro, S. The Ecocyc database.
Nucleic Acids Res. 2002, 30,56-58
Yomano, L.P.; York, S.W.; Ingram, L.O.
Journal of Industrial Microbiology &
Biotechnology.
Isolation
and
characterization of ethanol-tolerant mutants
of Escherichia coli KO11 for fuel ethanol
production. 1998, 20, 132-138
Genome research environment - GenRE,
http://mips.gsf.de/genre/proj/genre
Download