Systems biology reports Modeling and Simulating Biological Networks -Glycolysis Case studyHsiang-Yuan Yeh(葉向原) CHIH-YU CHEN(陳芝妤) Chien-Chih Tu(杜建志) WEI-CHIH LIN(林威志) Department of Information Systems and Applications, National Tsing Hua University 101, Section 2, Kuang-Fu Road, Hsinchu 300, Taiwan, R.O.C d926708@oz.nthu.edu.tw, g926741@oz.nthu.edu.tw, g937626@oz.nthu.edu.tw, g936739@oz.nthu.edu.tw interpret the information and knowledge. Abstract Developments in high-throughput Much biological research results of genomic measurement technologies for biology have sequences and biological pathway become created a paradigm shift in modern life available in certain electronic forms via science research. The filed of systems Internet or Webs. In addition, NCBI has biology provides system-level understanding made its MEDLINE database web functional genomic and pathway. I attempt accessible that currently contains over 14 to build a mathematical and chemical million citations for biomedical abstracts framework with some conditions and back to the 1950's, growing by more than biological knowledge to dynamically model 40,000 abstracts each month. The abstracts and simulate the biological network. This of MEDLINE provided a very rich project leads to understand complex biological knowledge base to be searched by biological systems and also speed up the biologists. However, the retrieval of an drug discovery. abstract from MEDLINE still used the keyword-based retrieval techniques. 1. Introduction The most important research of the Biologists often have to manually read human genome project is pushing scientists through the details in the texts retrieved in to join a new view of biology. In biology, order to find the information they actually there are many questions we had not solved. needed. Bioinformatics is the part of molecular that 1.1 Systems biology involves working with the biological data We could extract simple interactions typically using computational processing. from the literature, but there is a challenge With Bioinformatics growing up, speed up for the life sciences is to understand the the drug discovery and biological researches. relationship of the component in the living Modern molecular biology and medical systems. Genetic networks are more research involves amount of data, and the complex for biologists to analyze. quantity of those data has grown Bioinformatics and genomics have identified exponentially. Much of the biological data many components that make up a living cell. reported in the literature have not been It helped us know gene function and captured in the database. Many interactions structure from sequence pattern recognition. unlike sequence data were found in on-line In the next decade, the research area will database. They are reported in the scientific interest in discovering the complex behavior journals in free-text formats. So, in my that underlies the development and disease. previous research is focused on extracting In the area, a major problem is networks of molecular interactions in biological the cellular process are worked through literature, reference to the technique reports. complex interactions among a large number Due to the automated genomic and of genes, proteins, and other molecules. proteomic sequence analysis, the gigantic Systems biology studies on biological amount of biological data and knowledge process as whole systems, and focus on produced has made great challenges not only gene-gene, protein-protein, cell-cell to the biology field itself but also to the new interactions. It provides the experiment valid information technology that is needed to models to construct and predict the behavior assist biologists to process, analyze, and of the biological systems. Because these reasons, we should merge systems biology in bioinformatics. Using system-level views understand the dynamic processes. Biotechnology leads to be larger, and more complex. The systems biology community developed information standards for sharing and developed cooperatively. The Systems Biology Markup Language (SBML) is a computer-readable format for representing models of biochemical reaction networks. Figure 1 MAPK signaling pathway 1.2 Pathway Genes are expressed at varying rate throughout the life of a cell. The expressions of the different genes vary continuously per second. Pathways are the conceptual networks of the molecular and cell biology described the interaction and intra- and inter- cellular dynamics. They are the key points to understand how the organism reacts in the environment or internal changes. Biologists concern with biology pathways (metabolic pathway, regulatory pathway, signal transduction pathway) by searching the database, such as Kyoto Encyclopedia of Genes and Genomes (KEGG), Alliance for Cellular Signaling (AfCS) and Signal Transduction Knowledge Environment (STKE). Pathway consisted of large number modules between receptor, genome and many diseases such as cancer. So, it is hard to be detected the overall biological interactions. Therefore, pathway database would not update and maintain the data quickly. The figure 1 shows the MAPK signaling pathway in KEGG. A graph of node (genes, enzymes) connected by edges (interactions). These relationships between the genes may be the transcriptional regulation or another interactions. 2. Motivation and impact When the amount of retrieved texts turns out to be plenty, it becomes a laborious task. Therefore many works based on natural language processing techniques that can automatically extract knowledge of a specific domain had been proposed. Among the works, some intended to extract information or relations directly from plain texts based on purely natural language processing techniques. Many systems utilized the syntactic and semantic grammar to help to extract knowledge. Many biologists waste a lot of time to try an error by doing experiments. It is not a efficient way to understand biological meaning and also waste a lot of money. If we can combine biological knowledge and use parameters to do the simulation before the experiments, it may speed up the biological research, and even support reliable results for biologists. According to the reason, we have some problem to be solved as following: 1) Where and how to gather the biological knowledge for the experiments? 2) How to do the simulation if you have already got some biological information? 3. Proposed solution 3.1 Semantic web and ontology Ontology is a formal language that explicitly describes abstract concepts and relationships among domain specific objects. It provides a common model of relationships among the terms in a domain vocabulary for representing and sharing of domain-specific knowledge. Nowadays, ontology was used in many areas of research and applications include the bioinformatics community. We believe it is important to extract information efficiently to combine the syntactic and semantic grammar. Some ontology development is heading toward building a conceptual hierarchy or relational network to classify a set of domain concepts or terminologies. The kind of ontologies are usually known as thesauri. Previous work such as WordNet , MeSH (Medical Subject Heading), and more recently GO (GeneOntology) were some examples. Other more sophisticated ontology constructions, on the other hand, are heading toward establish schemes or classes to capture more elaborate relations among domain objects. To describe the ontology, the semantic web community, has provided a set of standards known as OWL, RDF/RDFS that are based on XML format. We argue that the use of thesauri and ontology together can significantly improve the performance of information extraction. We show how the domain ontology and thesauri can be integrated in the information extraction process. In systems biology, there is a biological knowledge behind the systems. The Gene ontology has built ontology described molecular function, biological process and cellar components. Figure 2 shows the structure of Gene ontology. Mesh had hierarchical categories for the components of the biology. That ontology is the tool for the unification of biology, and help people doing the biological vocabulary semantic annotation and function classification. Figure 2 Gene ontology 3.2 Web service There are many biological databases in the Internet. But it is hard to integrate the knowledge from those databases. We use the web service technique to deal with these problems. However, instead of being restricted to a single purpose, the agent-based system must be flexible and extensible. By being flexible the system can accept different goals from the biologist if the goal is compatible with the capabilities of the analyzing tools. By being extensible the system can be enhanced with additional tools and knowledge without need to change the system architecture. The Web service approach is introduced to satisfy these two desirable properties. Information about how to operate a tool and what the tool does should be distributed to the description of each service rather than hard coded within the agent code. The agent must be able to consult an external biological ontology, written in OWL to understand the meanings of the service descriptions, written in OWL-S. The service descriptions and ontology enables reasoning about the capabilities of the tools. Thus this architecture is extensible because new processing tools can be wrapped as Web services and added to the system, and flexible because a Web service can be used in various circumstances that change with the processing goal. 3.3 Biological pathway simulation Many chemical reactions are involved in the biological pathway. It also simulates the kinetics of systems of biochemical reactions and provides a number of tools to fit models to data, optimize any function of the model, and perform metabolic control analysis and linear stability analysis. Copasi is a software package for modeling biochemical systems. Copasi simplifies the task of model building by assisting the user in translating the language of chemistry (reactions) to mathematics (matrices and differential equations) in a transparent way. This is combined with a set of sophisticated numerical algorithms that assure the results are obtained fast and accurate. 4. System Architecture For making the biological research more quickly and correctly. We have an idea to integrate the database, biological tools, modeling and simulating pathway toolkit. Figure 3 is our global system architecture. Our system consist of the four module as following: Pathway modeling agent Model the pathway according to the promoter and molecular interactions Extract the molecular interactions and Chemical coefficients Quantitative Simulation agent Bio-ontology & thesauri According to the Quality of Service and use’s goal to make the Biological plan Literature extraction agent Measure the chemical values by calculating the coefficients and pathway structure Workflow Planning agent Information wrapper Agent Information Gathering Web service Matchmaker (Broker Agent) Connect the service Database (KEGG, NCBI, Micro-array)、Bioinformatics Toolkit Figure 3. Global Agent Archiecture 1) Web service - information wrapper agent There are many heterogeneous databases for the biology, such like bioinformatic toolkits, gene database, protein database, and pathway database. We develop the agent environment accessing and filtering this information automatically by using web service technique. Then, biologists do not open all the web page and know how the program works. 2) Literature extraction agent Relational database system for managing kinetic data, chemical structure, pathway, and chemical reaction provide stoichiometric information and parameters for kinetics equations to the model. The concentration, reaction rate…etc in the previous works. We will extract this information automatically using literature extraction agent. 3) Pathway modeling agent KEGG has provided many pathways to biologists, but still have many molecular interactions in the pathway. Micro-array is a high throughput tool to get the gene expression. There also are many interaction and parameters in the biological literature. We will gather the information and try to model the pathway from many resources. 4) Workflow planning agent We integrate many kinds of the bioinformatics toolkits to do the complex works. When the user set the query, the agent will find the proper tool and construct the workflow to reach the goal. Gene annotation is a complex work for an unknown sequence. It should find the open reading frame and compare the other proteins from BLAST, and check the protein domain from another tool such like InterproScan, Pfam…etc. 5) Quantitative simulating agent In quantitative simulating agent, he stores a lot of the formula of the chemical reactions. He also has knowledge to know which formula will be used for the different conditions. It calculates the coefficient of each compound in biological pathway and some constraints about chemical reactions. 6) Shared Biological Ontology Ontology presents the concept of the pathways and reactions. The systems would know every compounds and enzymes in those pathways. There are not only lots of strings, and have some semantic meaning for the machine. 7) Experiments and Methods The figure 4 shows the workflow of the biological pathway modeling and simulation. We will discuss the workflow later. Kinetics database Get the kinetics coefficient from the experiments or literature Biological database Get the gene name, chemical compound and its physical information chemical database Get the chemical reaction Pathway database Get the biological pathway Dynamic model Stoichiometric model Figure 4 Workflow description 6.1 KEGG web service KEGG API provides valuable means for accessing the KEGG system, such as for searching and computing biochemical pathways in cellular processes or analyzing the universe of genes in the completely sequenced genomes. The users can access the KEGG API server by the SOAP technology over the HTTP protocol. The SOAP server also comes with the WSDL, which makes it easy to build a client library for a specific computer language. The web service functions showed in Figure 5 and 6. Figure 5 KEGG web service query Figure 6 KEGG web service results 6.2 Ontology-based knowledge extraction Extracting knowledge directly from natural language plain texts is not a trivial task. Its difficulty comes from the reason that the relations or knowledge to be extracted can be embedded deep in the sentences that cannot be easily extract from simple keyword matching methods without sophisticated inference. To extract knowledge exactly can be as hard as a task of the deep understanding of a natural language text. To reduce the difficulty of knowledge extraction directly from natural language texts for intelligent software agents, we conduct the automatically semantic annotation tasks before the actual knowledge extraction process. We use the sentence ”The pyruvate concentration that is required to accommodate a flux of 0.48 C-mol/min*L-cytosol, is 8 mM.” as example. Step 1. Automatic semantic annotation using thesauri After annotating, we will get the results: The<SW.N> pyruvate<ME.D> concentration<ME.F> that<WN.AD> is<WN.V> required<WN.V> to<WN.AD> accommodate<WN.V> a<SW.N> flux<WN.N> of<SW.P> 0.48<U> C-mol<SI>/<SI>min<SI>*<SI>L-cytosol<S I>, is<WN.V> 8<U> mM<SI>.<PU> The semantic codes <ME.D>, <ME.C>, <ME.Q> indicate Descriptors, Supplementary Concept, and Qualifiers in MeSH respectively; and <SW.A>, <SW.P>, <SW.C>, and <SW.N> indicate Articles, Prepositions, Conjunctions, and Nouns in the Stop-word list respectively; <WN.A>, <WN.AD>, <WN.V>, <WN.N> indicates the Adjective, Adverb, Verb, and Noun in WordNet respectively; <U> indicates the Number; <SI> indicates Metric System Unit; <PU> indicates the Punctuation Mark; <ABBR> indicates the Abbreviation; and <GO.FUN>, <GO.PRO>, <GO.COM> indicate Molecular Function, Biological Process, and Cellar Component in GeneOntology respectively. Step 2.The pattern grammar rules We have some pattern rules to combine the single words to the phrases. The example will get the semantic tag like this: The pyruvate<ME.D> concentration<ME.F> that<WN.AD> is required to<WN.V> accommodate<WN.V> a flux<WN.N> of<SW.P> 0.48<U> C-mol/min*L-cytosol<SI>, is<WN,V> 8<U> mM<SI>.<PU> Step 3. Map the syntactic grammar to the semantic structure in the domain ontology After pattern matching, we substitute the proper noun phrase patterns with special variable symbols “x’s” to simplify the subsequent parsing processes. This substituted sentence is given to Minipar parser that yields a dependency tree as shown in Figrure 7. It extract the subject and object in syntactic level.The parsing results are then used to extract the relationships in the sentence in the format of the RDF-triple <Subject, Predicate, Object>.We combine the syntactic and semantic relationships mapping to the semantic template defined in the domain ontology. We can extract the triple (The pyruvate concentration, is, 8 mM) and (The pyruvate flux, is, 0.48 C-mol/min*L-cytosol) ( (() U * ) (() fin C E2 ) 1 (The ~ Det 3 det (gov concentration)) 2 (pyruvate ~N 3 nn (gov concentration)) 3 (concentration ~ N 5 s (gov require)) 4 (is be be 5 be (gov require)) 5 (required require V E0 i (gov fin)) 6 (to ~ Aux 7 aux (gov accommodate)) 7 (accommodate ~V 5 sc (gov require)) E3 (() concentration N 7 subj (gov accommodate) (antecedent 3)) 8 (a ~ Det 9 det (gov flux)) 9 (flux ~N 7 obj (gov accommodate)) 10 (of ~ Prep 9 mod (gov flux)) 11 (0.48 ~N 16 num (gov C-mol/min*L-cytosol)) 12 (C ~U 16 lex-mod (gov C-mol/min*L-cytosol)) 13 (~U 16 lex-mod (gov C-mol/min*L-cytosol)) 14 (mol/min*L ~U 16 lex-mod (gov C-mol/min*L-cytosol)) 15 (~U 16 lex-mod (gov C-mol/min*L-cytosol)) 16 (cytosol C-mol/min*L-cytosol N 10 pcomp-n (gov of)) 19 (8 ~U 20 lex-mod (gov "8 mM")) 20 (mM "8 mM" N E2 ) E2 E0 Figure 7 the results of the minpar Step 4. Convert the plain text into OWL format In this step, the system assigned semantic categories in terms of domain ontology to the phrase terms in the triples that were extracted by the parser. It wastes a lot of time to conduct semantic annotation automatically, so the system converts each sentence into OWL instance format. 6.3 Modeling and Simulating Biological pathway – Glycolysis In this project, we used biochemical network simulation software, Copasi, to simulate the process of glycolysis in figure 8. Due to the lack of the enzyme, EC4.1.1.1, in E. coli, we resorted to use the data from Yeast to do the simulation in order to compute how the concentration of ethanol can be changed. Copasi is the newly improved version of an older and commonly used biochemical simulator called Gepasi. The following figure 10 shows the interface of Copasi and the configuration options for the network. In Reactions, all the reactions involved in our selected pathway are inputted here. Figure 8 Glycolysis Figure 9 Differential equation of glycolysis Figure10 Reaction Inputs Overview By clicking onto the desired reaction on the left, we can examine the details of the specific reaction. Figure 11 below shows the details of one of the reactions. The enzyme that catalyzes the selected reaction was chosen to represent the reaction name. In this particular reaction, P2G is the only substrate and PEP is the only product. The parameters listed are V, Ka for the substrate, Kp for the product and Keq. The values are predetermined so the rate equationcan be used. In the equation, a and p represent the concentrations of the corresponding substrate and product, respectively. Γ is the mass-action ratio, p/a, and Ka and Kp are the Michaelis-Menten constants for a and p. The equation is specifically used for reversible reactions with only one substrate and one product. The rate equation is inputted under “Functions” on the left directory with the name, function_4_vENO_2, and that is chosen for the Kinetics of this reaction. By double-clicking on “Functions”, and then function_4_vENO_2, we get Fig.12, which contains the formula to calculate the rate v and the data type of each parameter. Figure 11 Reaction Input details of the reaction P2G ←→PEP using enzyme ENO Fig.12 Details of “function_4_vENO_2” for P2G ←→PEP using enzyme ENO For the case of reversible Michaelis Menten kinetics with two non-competing substrates and products, another equation is used to take into account for both substrates and products: In this case, a and b represent the two substrates, and p and q represents two products. This kinetic equation is used for enzymes HK, GAPDH, PFK and PYK Fig.13. Details of “function_4_vENO_2” for P2G ←→PEP using enzyme ENO. Only one compartment is needed in our simulation, therefore all of the metabolites are set to be situated in uVol shown on Figure 13. Again we double-click on the desired metabolite to examine the details. Initial concentrations can be set from the right window in figure 13 directly by selecting the cell and typing in the desired value. We will be creating plots and reports for the concentration change over time. Thus, by changing the initial concentration of any desired metabolites, prediction for the ETOH yield can be calculated by the software. This way we can understand how concentration of ETOH may be affected by increasing the concentration of different metabolites. Now that all the inputs are set, we can start to form reports and plots. In figure 14, we set up items to be reported by clicking on the plus sign and change the order by selecting the item and clicking the up and down arrows. We can also set up plots as an output by adding a default plot and the concentrations of metabolites will be the default output. In figure 15, unnecessary metabolites can be deleted by clicking “delete curve”. Fig.14. Report Definition Fig.15. Plot configuration Once we get the output settings out of the way, the last step is to obtain the result. First we need to click on ReportDefinition to set the directory where we want to save our result and the window from Figure 17 will pop out. Once the target is chosen, we click on Confirm and then click “Run”. Fig.16. Time Course Result Fig.17. Report Definition Selection 7. Results and Conclusion The results shown below are outputs of the plot and the report respectively in figure 18 and 19. Even though they are different forms of output, they are essentially the same thing. From the concentration plot, the lines we do not want to display can be deselected by clicking on metabolite on the bottom. The purple line here represents the concentration of Glucose transported inside, whereas the red and green lines show the concentration of ETOH and CO2 respectively. Fig.18. Result: Concentration Plot Fig.19. Result: Report saved in a chosen file To show how [ethanol] changes with [AcAld] as an example, we increase the initial concentration of AcAld by 20M. In figure 20, The concentration of ethanol is predicted to be increased roughly by 10M and note the sigmoidal curvature for the red ETOH line at the beginning due to the drastic change in concentration of AcAld. Therefore, by using Copasi, we can estimate how concentration of ETOH changes with other concentrations over time. Fig.20. The concentration of the ethanol production (red) We also use the view of the biosynthesis for the ethanol production in E.coli. Ethanol production by microorganism was processed for a long time and different kinds of usage in the filed. Like cellulose used for raw material, and ethanol is product. The effectiveness is the most important that scientists always considerate toward. In references, we suppose that biosynthesis of ethanol use E.coli is good way to carry out. With this, we can leave out some processes for improving much more efficient than the way traditionally produced either synthetically from ethylene or fermentation of grain, cellulose or sugar. The key point of ethanol production by E.coli is pyruvate which is the important intermediate in metabolic pathway in figure 21. Comparatively to common metabolic pathway, E.coli lacks for two enzymes to catalyze pyruvate to ethanol. Metabolism from pyruvate to ethanol, acetaldehyde is intermediate. Figure 21 Pyruvate metabolism In Figure 22 shows the enzymes which E.coli lacks are pyruvate decarboxylase and alcohol dehydrogenase. With metabolic pathway and gene database – EcoCyc (Encyclopedia of Escherichia coli K12 Genes and Metabolism) and NCBI, E.coli strain K12 have alcohol dehydrogenase (adhP) in step 2. In spite of strain K12, E.coli strain KO11 is an ethanol-producing recombinant in which genes for ethanol producing were cloned from Z. mobilis (pdc, adhB). Besides, pyruvate decarboxylase from Saccharomyces cerevisiae (pdc1) was successfully isolated and fused to E.coli indicator gene LacZ and T7 RNA polymerase promoter phi 10 can express in E.coli. predication, treatment), in the future, it may become more popular for and personal health care. Figure 22 Ethanol production from pyruvate Below are two ways for us to carry out biosynthesis of ethanol by E.coli: 1) Escherichia coli strain KO11 An ethanol-producing recombinant with Z. mobilis genes: pyruvate decarboxylase (pdc) and alcohol dehydrogenase (adhB). 2) Escherichia coli strain K12 with pdc1 cloning A strain has alcohol dehydrogenase (adhP), but lacks for pyruvate decarboxylase. So we can clone Saccharomyces cerevisiae pyruvate decarboxylase (pdc1) and even T7 RNA polymerase promoter phi 10. From glycolysis to pyruvate and then pyruvate to ethanol, we consider the most partial processes in metabolism that direct or indirect effect the yield. And control the substrate or product by inhibits one or more reaction to get the most ethanol production by E.coli. 8. Future work The combination of flux based static modeling with dynamic modeling based on kinetic equations. The model can be initiated as a stoichiometric model that is gradually converted into a dynamic model by adding dynamic equations. Flux distribution analysis as a method for calculating each flux in stoichiometric models. Substances at the boundary between dynamic models and stoichiometric model are influenced by both fluxes. Nowadays, biological interactions are more complex in life science. New kinds of cancers or virus will grow fast. Only biologists resisted these problems is not enough. So, we should combine the computational biology to deal with a large amount of data. According to the standards and the common platform, many researches about the life science in the world will share and cooperate together. It would speed up the drug discovery and drug design. Further expansion into the clinical diagnostics and therapeutics (disease classification, Reference Can yeast glycolysis be understood in terms of in vitro kinetics of the constituent enzymes? Testing biochemistry. Bas Teusink 1, * , Jutta Passarge 2, , Corinne A. Reijenga 2 , Eugenia Esgalhado Eur. J. Biochem. 267, 5313-5329 (2000) Karp, P.D.; Riley, M.; Saier, M.; Paulsen, I.T.; Collado-Vides, J.; Paley, S.; Pellegrini-Toole, A.; Bonavides, C.; Gama-Castro, S. The Ecocyc database. Nucleic Acids Res. 2002, 30,56-58 Yomano, L.P.; York, S.W.; Ingram, L.O. Journal of Industrial Microbiology & Biotechnology. Isolation and characterization of ethanol-tolerant mutants of Escherichia coli KO11 for fuel ethanol production. 1998, 20, 132-138 Genome research environment - GenRE, http://mips.gsf.de/genre/proj/genre