Computational Exploration of Metabolic Networks with Pathway Tools Part 1: Overview & Representations Suzanne Paley Bioinformatics Research Group SRI International paley@ai.sri.com http://BioCyc.org/ Motivation: Theories of CellularSRI International Bioinformatics Function Too Large for One Mind to Grasp Example: E. coli metabolic network 160 pathways involving 744 reactions and 791 substrates Example: E. coli genetic network Control by 97 transcription factors of 1174 genes in 630 transcription units Past solutions: Partition theories across multiple minds Encode theories in natural-language text We cannot compute with theories in those forms Evaluate theories for consistency with new data: microarrays Refine theories with respect to new data Compare theories describing different organisms Solution: Biological Knowledge Bases SRI International Bioinformatics Store biological knowledge and theories in computers in a declarative form Amenable to computational analysis and generative user interfaces Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases A high quality comprehensive knowledge base enables us to ask and answer important new questions Terminology Organism Database (MOD) – DB describing genome and other information about an organism Model Pathway/Genome Database (PGDB) – MOD that combines information about Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters, operons, DNA binding sites – Collection of 15 PGDBs at BioCyc.org EcoCyc, AgroCyc, HumanCyc BioCyc SRI International Bioinformatics Pathway Tools Software SRI International Bioinformatics PathoLogic Prediction of metabolic network from genome Computational creation of new Pathway/Genome Databases Pathway/Genome Editors Distributed curation of genome annotations Distributed object database system Interactive editing tools Pathway/Genome Navigator WWW publishing of PGDBs Graphic depictions of pathways, chromosomes, operons Analysis operations Pathway visualization of gene-expression data Global comparisons of metabolic networks Pathway Tools Software SRI International Bioinformatics Pathway/Genome Navigator PathoLogic Pathway Predictor Pathway/ Genome Databases Pathway/ Genome Editors Pathway/Genome Database Pathways Reactions Compounds Proteins Genes Operons, Promoters, DNA Binding Sites Chromosomes, Plasmids CELL SRI International Bioinformatics Pathway Tools Algorithms Visualization and editing tools for following datatypes Full Metabolic Map Paint gene expression data on metabolic network; compare metabolic networks Pathways Pathway prediction Reactions Balance checker Compounds Chemical substructure comparison Enzymes, Transporters, Transcription Factors Genes Chromosomes Operons Operon prediction; visualize genetic network SRI International Bioinformatics SRI International Bioinformatics Definitions Chemical reactions interconvert chemical compounds A+B C+D An enzyme is a protein that accelerates chemical reactions A pathway is a linked set of reactions Often regulated as a unit A A conceptual unit of cell’s biochemical machine C E SRI International Bioinformatics SRI International Bioinformatics SRI International Bioinformatics SRI International Bioinformatics SRI International Bioinformatics SRI International Bioinformatics SRI International Bioinformatics SRI International Bioinformatics SRI International Bioinformatics SRI International Bioinformatics Operations of the Metabolic Overview Find SRI International Bioinformatics pathways, compounds Find reactions By enzyme name, EC number, substrates, modulation All with isozymes All occurring in multiple pathways By EC class, pathway class Find genes By name, gene class All regulated by transcriptional regulator protein Metabolic Overview Queries SRI International Bioinformatics Species comparison Highlight reactions that are Shared/not-shared with Any-one/All-of A specified set of species Overlay expression data Colors reflects expression level and are user-configurable Can show single experiment or animated time series EcoCyc Project E. coli Encyclopedia Model-Organism Database for E. coli Began in 1992 as collaboration between Karp and Riley Over 3500 literature citations Collaborative development via Internet Karp (SRI) -- Bioinformatics architect John Ingraham -- Advisor (SRI) Metabolic pathways Saier (UCSD) and Paulsen (TIGR)-- Transport Collado (UNAM)-- Regulation of gene expression Ontology: 1000 biological classes Database content: 17,700 instances SRI International Bioinformatics SRI International EcoCyc = E.coli Dataset + Bioinformatics Pathway/Genome Navigator Pathways: 165 Reactions: 2,760 Enzymes: 914 Transporters: 162 Proteins: 4,273 Promoters: 812 TransFac Sites: 956 Citations: 3,508 Compounds: 774 Genes: 4,393 Transcription Units: 724 Factors: 110 http://BioCyc.org/ SRI International Bioinformatics MetaCyc: Metabolic Encyclopedia Nonredundant metabolic pathway database Describe a representative sample of every experimentally determined metabolic pathway Literature-based DB with extensive references and commentary Pathways, reactions, enzymes, substrates 460 pathways, 1267 enzymes, 4294 reactions 172 E. coli pathways, 2735 citations Nucleic Acids Research 30:59-61 2002. Jointly developed by SRI and Carnegie Institution New focus on plant pathways MetaCyc Data MetaCyc SRI International Bioinformatics contains one DB object for each distinct pathway Distinct in terms of reaction steps Each pathway labeled with species it occurs in MetaCyc 4218 pathways are experimentally determined reactions in MetaCyc 401 lack EC numbers MetaCyc Enzyme Data Reaction(s) catalyzed Alternative substrates Cofactors / prosthetic groups Activators and inhibitors Subunit structure Molecular weight, pI Comment, literature citations Species SRI International Bioinformatics MetaCyc Frequent Organisms Escherichia coli 156 Arabidopsis thaliana 47 Homo sapiens 30 Pseudomonas 21 Bacillus subtilis 20 Salmonella typhimurium 20 Sulfolobus solfataricus 18 Pseudomonas putida 14 Saccharomyces cerevisiae 14 Haemophilus influenzae 13 Glycine max 11 Deinococcus radiourans 10 SRI International Bioinformatics EcoCyc and MetaCyc Review SRI International Bioinformatics level databases Data derived primarily from biomedical literature Manual entry by staff curators Updates by staff curators only Data validation Consistency constraints Lisp programs that verify other semantic relationships Unbalanced chemical reactions SRI International Bioinformatics Computationally-Derived PGDBs Annotated Genomic Sequence Pathway/Genome Database Gene Products Pathways Genes/ORFs DNA Sequences Multi-organism Pathway Database (MetaCyc) Pathways Reactions PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Compounds Gene Products Genes Reactions Genomic Map Compounds SRI International Bioinformatics PathoLogic Input/Output Inputs: File listing genetic elements http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat Files containing DNA sequence for each genetic element Files containing annotation for each genetic element MetaCyc database Output: Pathway/genome database for the subject organism Directory tree for the subject organism Reports that summarize: Evidence contained in the input genome for the presence of reference pathways Reactions missing from inferred pathways PathoLogic Functionality Initialize SRI International Bioinformatics schema for new PGDB Transform existing genome to PGDB form Infer metabolic pathways and store in PGDB Infer operons and store in PGDB Assist user with manual tasks Assign enzymes to reactions they catalyze Identify false-positive pathway predictions Build protein complexes from monomers Assemble Overview diagram SRI International Bioinformatics BioCyc Collection of Pathway/Genome DBs Literature-based Datasets: Computationally-derived datasets: Escherichia Agrobacterium coli (EcoCyc) MetaCyc PGDBs at other sites: Arabidopsis thaliana (TAIR) Methanococcus jannaschii (EBI) Saccharomyces cerevisiae (SGD) Synechocystis PCC6803 http://BioCyc.org/ tumefaciens Caulobacter crescentus Chlamydia trachomatis Bacillus subtilis Helicobacter pylori Haemophilus influenzae Homo sapiens Mycobacterium tuberculosis RvH37 Mycobacterium tuberculosis CDC1551 Mycoplasma pneumonia Pseudomonas aeruginosa Treponema pallidum Vibrio cholerae Yellow = Open Database SRI International Bioinformatics HumanCyc: Human Metabolic Pathway Database PGDB of human metabolic pathways built using PathoLogic Contains information on 28,700 genes, their products, and the metabolic reactions and pathways they catalyze (no signalling pathways) Chromosome and contigs from Ensembl Human genetic loci from LocusLink Mitochondrion data from GenBank Ensembl and LocusLink gene entries were merged to eliminate redundancies where possible. Contains links to human genome web sites Plan to hire one curator to refine and curate with respect to literature over a 2 year period Remove false-positive predictions Insert known pathways missed by PathoLogic Add comments and citations from pathways and enzymes to the literature Add enzyme activators, inhibitors, cofactors, tissue information Funded by commercial consortium BioCyc and Pathway Tools Availability WWW SRI International Bioinformatics BioCyc freely available to all BioCyc.org Six BioCyc DBs openly available to all BioCyc DBs freely available to non-profits Flatfiles downloadable from BioCyc.org Binary executable: Sun UltraSparc-170 w/ 64MB memory PC, 400MHz CPU, 64MB memory, Windows-98 or newer PerlCyc API Pathway Tools freely available to non-profits Information Sources Pathway Tools User’s Guide aic-export/ecocyc/genopath/released/doc/userguide1.pdf Pathway/Genome Navigator Appendix A: Guide to the Pathway Tools Schema aic-export/ecocyc/genopath/released/doc/userguide2.pdf PathoLogic, Editing Tools Pathway Tools Web Site http://bioinformatics.ai.sri.com/ptools/ Publications, programming examples, etc. Pathway Tools Tutorial http://bioinformatics.ai.sri.com/ptools/tutorial/ SRI International Bioinformatics SRI International Bioinformatics Pathway Tools Implementation Details Allegro Common Lisp Sun and PC platforms Ocelot object database 250,000 lines of code Lisp-based WWW server at BioCyc.org Manages 15 PGDBs Frame Data Model Frame Data Model -- organizational structure for a PGDB Knowledge Frames Slots SRI International Bioinformatics base (KB, Database, DB) Knowledge Base Collection SRI International Bioinformatics of frames and their associated slots, values, facets, and annotations AKA: Database, PGDB Can be stored within An Oracle DB A disk file A Pathway Tools binary program Frames SRI International Bioinformatics Entities with which facts are associated Kinds of frames: Classes: Genes, Pathways, Biosynthetic Pathways Instances (objects): trpA, TCA cycle Classes: Superclass(es) Subclass(es) Instance(s) A symbolic frame name (id, key) uniquely identifies each frame Slots SRI International Bioinformatics Encode attributes/properties of a frame Integer, real number, string Represent relationships between frames The value of a slot is the identifier of another frame Every slot is described by a “slot frame” in a KB that defines meta information about that slot Properties of Slots SRI International Bioinformatics Number of values Single valued Multivalued: sets, bags Slot values Any LISP object: Integer, real, string, symbol (frame name) Slotunits define properties of slots: datatypes, classes, constraints Two slots are inverses if they encode opposite relationships Slot Product in class Genes Slot Gene in class Polypeptides Pathway Tools Ontology SRI International Bioinformatics 1064 classes Main classes such as: Pathways, Reactions, Compounds, Macromolecules, Proteins, Replicons, DNA-Segments (Genes, Operons, Promoters) Taxonomies for Pathways, Reactions, Compounds 205 slots Meta-data: Creator, Creation-Date Comment, Citations, Common-Name, Synonyms Attributes: Molecular-Weight, DNA-Footprint-Size Relationships: Catalyzes, Component-Of, Product Classes, instances, slots all stored side by side in DBMS, share a single namespace SRI International Bioinformatics Slot Links from Gene to Pathway Frame TCA Cycle left succinate in-pathway FAD succinate + FAD = fumarate + FADH2 reaction fumarate right FADH2 Enzymatic-reaction catalyzes Succinate dehydrogenase component-of Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 product Chrom sdhA sdhB sdhC sdhD Enzymatic-reaction frame stores SRI International Bioinformatics properties of pairing between enzyme and reaction TCA Cycle EC# Keq Succinate + FAD = fumarate + FADH2 Enzymatic-reaction Succinate dehydrogenase Cofactors Inhibitors Molecular wt pI Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 sdhA sdhB sdhC sdhD Left-end-position Monofunctional Monomer Pathway Reaction Enzymatic-reaction Monomer Gene SRI International Bioinformatics SRI International Bioinformatics Bifunctional Monomer Pathway Reaction Reaction Enzymatic-reaction Enzymatic-reaction Monomer Gene Monofunctional Multimer SRI International Bioinformatics Pathway Reaction Enzymatic-reaction Multimer Monomer Monomer Monomer Monomer Gene Gene Gene Gene Pathway and Substrates Reactant-1 Pathway left in-pathway Reactant-2 Reaction Product-1 Product-2 SRI International Bioinformatics right Reaction Reaction Reaction SRI International Bioinformatics Genetic Network Representation Describe biological entities involved in control of transcription initiation Promoters, operators, transcription factors, operons, terminators Describe molecular interactions among these entities Modulation of transcription factor activity Binding of transcription factors to DNA binding sites Effects on transcription initiation Ontology for Transcriptional Regulation SRI International Bioinformatics One DB object defined for each biological entity and for each molecular interaction trp Complexation reaction apoTrpR trpLEDCBA site001 Int001 pro001 Int002 TrpR*trp RpoSig70 trpL trpE trpD trpC trpB trpA Int001 (binding of TrpR*trp to site001) inhibits Int002 (binding of RNA Polymerase to promoter) and consequently prevents transcription of genes in transcription unit. Principle Classes Class names are capitalized, plural Genetic-Elements, with subclasses: Chromosomes Plasmids Genes Transcription-Units RNAs Proteins, with subclasses: Polypeptides Protein-Complexes SRI International Bioinformatics Principle Classes Reactions, with subclasses: Transport-Reactions Enzymatic-Reactions Pathways Compounds-And-Elements SRI International Bioinformatics Slots in Multiple Classes SRI International Bioinformatics Common-Name Synonyms Names (computed as union of Common-Name, Synonyms) Comment Citations DB-Links Genes Slots Chromosome Left-End-Position Right-End-Position Centisome-Position Transcription-Direction Product SRI International Bioinformatics Proteins Slots Molecular-Weight-Seq Molecular-Weight-Exp pI Locations Modified-Form Unmodified-Form Component-Of SRI International Bioinformatics Polypeptides Slots Gene SRI International Bioinformatics Protein-Complexes Slots Components SRI International Bioinformatics Reactions Slots SRI International Bioinformatics EC-Number Left, Right Substrates (computed as union of Left, Right) Enzymatic-Reaction DeltaG0 Spontaneous? Enzymatic-Reactions Slots Enzyme Reaction Activators Inhibitors Physiologically-Relevant Cofactors Prosthetic-Groups Alternative-Substrates Alternative-Cofactors Reaction-direction SRI International Bioinformatics Pathways Slots Reaction-List Predecessors Primaries SRI International Bioinformatics