The BioCyc Ontologies Markus Krummenacker Bioinformatics Research Group SRI International kr@ai.sri.com BioCyc.org EcoCyc.org, MetaCyc.org, HumanCyc.org 1 SRI International Bioinformatics Overview Pathway/Genome Databases (PGDBs) BioCyc collection EcoCyc, MetaCyc Pathway Tools Software & Applications Visualization, Editing, Analysis, Omics data Inference tools: PathoLogic, Operon predictor, Pathway hole filler Tools for debugging a predicted metabolic network Some Ontology Details Pathways, Reactions and Compounds, Enzymes, Genes Regulation Integration with other efforts: BioPAX, GO, NCBI Taxonomy 2 SRI International Bioinformatics Model Organism Databases / PGDBs 3 DBs that describe the genome and molecular machinery of one specific organism. Integrating many diverse types of data into a coherent model of a cell Every sequenced organism with an active experimental community requires a MOD Integrate genome data with information about the biochemical and genetic network of the organism Integrate literature-based information with computational predictions Ongoing updating of sequence, gene positions and functions, regulatory sites, pathways MODs are platforms for global analyses of the organism Interpret omics data in a pathway context In silico prediction of essential genes Characterize systems properties of metabolic and genetic networks SRI International Bioinformatics BioCyc Collection of Pathway/Genome Databases Database (PGDB) – combines information about Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors/sites, promoters, operons Pathway/Genome Tier 1: Literature-Derived PGDBs MetaCyc EcoCyc -- Escherichia coli K-12 Tier 2: Computationally-derived DBs, Some Curation -- 20 PGDBs HumanCyc Mycobacterium tuberculosis Tier 3: Computationally-derived DBs, No Curation -- 349 DBs 4 SRI International Bioinformatics Pathway Tools: PathoLogic Inference Annotated Genome MetaCyc Reference Pathway DB PathoLogic Pathway/Genome Database Pathway/Genome Editors 5 Pathway/Genome Navigator SRI International Bioinformatics Pathway Tools Software: PGDBs Created Outside SRI 1,300+ licensees: 75+ groups applying software to 200+ organisms Saccharomyces cerevisiae, SGD project, Stanford University Mouse, MGD, Jackson Laboratory dictyBase, Northwestern University Under development: CGD (Candida albicans), Stanford University Drosophila, P. Ebert in collaboration with FlyBase C. elegans, P. Ebert in collaboration with WormBase Planned: RGD (Rat), Medical College of Wisconsin Arabidopsis thaliana, TAIR, Carnegie Institution of Washington PlantCyc, ~20 plant PGDBs, Carnegie Institution of Washington Six Solanaceae species, Cornell University GrameneDB, Cold Spring Harbor Laboratory Medicago truncatula, Samuel Roberts Noble Foundation 6 SRI International Bioinformatics Pathway Tools Software: PGDBs Created Outside SRI BioHealthBase (M. tuberculosis, F. tuleremia), PATRIC, ApiDB Gary Xie, Los Alamos Lab, Dental pathogens F. Brinkman, Simon Fraser Univ, Pseudomonas aeruginosa V. Schachter, Genoscope, Acinetobacter M. Bibb, John Innes Centre, Streptomyces coelicolor G. Church, Harvard, Prochlorococcus marinus, multiple strains E. Uberbacher, ORNL and G. Serres, MBL, Shewanella onedensis R.J.S. Baerends, University of Groningen, Lactococcus lactis IL1403, Lactococcus lactis MG1363, Streptococcus pneumoniae TIGR4, Bacillus subtilis 168, Bacillus cereus ATCC14579 Matthew Berriman, Sanger Centre, Trypanosoma brucei, Leishmania major Herbert Chiang, Washington University, Bacteroides thetaiotaomicron Sergio Encarnacion, UNAM, Sinorhizobium meliloti Gregory Fournier, MIT, Mesoplasma florum Mark van der Giezen, University of London, Entamoeba histolytica, Giardia intestinalis Michael Gottfert, Technische Universitat Dresden, Bradyrhizobium japonicum Artiva Maria Goudel, Universidade Federal de Santa Catarina, Brazil, Chromobacterium violaceum ATCC 12472 7 SRI International Bioinformatics Pathway Tools Software: PGDBs Created Outside SRI Large scale users: C. Medigue, Genoscope, 150+ PGDBs G. Burger, U Montreal, 60+ PGDBs Bart Weimer, Utah State University, Lactococcus lactis, Brevibacterium linens, Lactobacillus acidophilus, Lactobacillus plantarum, Lactobacillus johnsonii, Listeria monocytogenes Partial 8 listing of outside PGDBs at BioCyc.org SRI International Bioinformatics Pathway Evidence 9 SRI International Bioinformatics Pathway Tools Overviews and Omics Viewers Provide genome-scale visualizations of cellular networks Harness human visual system to interpret patterns in biological contexts Designed to avoid the hairball effect Generated automatically from PGDB Magnify, interrogate Omics viewers paint omics data onto overview diagrams Different perspectives on same dataset Use animation for multiple time points or conditions Paint any data that associates numbers with genes, proteins, reactions, or metabolites 10 SRI International Bioinformatics Regulatory Overview and Omics Viewer Show regulatory relationships among gene groups 11 SRI International Bioinformatics 12 SRI International Bioinformatics 13 SRI International Bioinformatics Comparative Analysis Via Cellular Overview Comparative genome browser Comparative pathway table Comparative analysis reports Compare reaction complements Compare pathway complements Compare transporter complements 14 SRI International Bioinformatics Pathway Tools Ontology 1621 Classes Main classes such as: 15 Pathways, Reactions, Compounds, Macromolecules, Proteins, Replicons, DNA-Segments (Genes, Operons, Promoters) Taxonomies for Pathways, Reactions (EC), Compounds Cell Component Ontology Protein Feature ontology 221 Slots for attributes and relationships Meta-data: Creator, Creation-Date Comment, Citations, Common-Name, Synonyms Attributes: Molecular-Weight, DNA-Footprint-Size Relationships: Catalyzes, Component-Of, Product Evidence codes, supporting citations SRI International Bioinformatics Pathway/Genome Database Schema 16 SRI International Bioinformatics Protein Feature Ontology 17 SRI International Bioinformatics Advanced Query Form Intuitive construction of complex database queries of SQL power 18 SRI International Bioinformatics Enzymatic-Reactions TCA Cycle in-pathway Succinate + FAD = fumarate + FADH2 reaction Enzymatic-reaction catalyzes Succinate dehydrogenase component-of Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 product sdhA 19 sdhB sdhC sdhD SRI International Bioinformatics Need for Enzymatic-Reactions Reactions can have isozymes Enzymes can be multi-functional 20 Enzymatic-Reaction frames are needed to decouple the many-to-many relationships Isozymes may have different inhibitors, etc. Gene-Reaction schema diagrams: SRI International Bioinformatics New Representation of Regulation Previously, regulation was represented idiosyncratically: One representation for modulation of enzymes Completely different representation for regulation of transcription initiation Now unified under single Regulation class w/ subclasses This enables us to easily add support for new kinds of regulation, e.g. Transcriptional attenuation (done) Regulation of translation by small RNAs (in progress) New tools for display and editing of new Regulation classes 21 SRI International Bioinformatics Operons and Transcription Units Operon: A set of two or more genes that are transcribed as a unit. May include multiple promoters. Transcription Unit: A set of one or more genes that are transcribed as a unit from a single promoter. Pathway Tools schema does not represent operons explicitly, only transcription-units 22 SRI International Bioinformatics Ontology for Transcriptional Regulation left trp BR001 apoTrpR components regulator trpLEDCBAp1 regulated-by trpLEDCBA right TrpR*trp reg001 trpL trpE associated-binding-site site001 trpD trpC trpB trpA 23 SRI International Bioinformatics Representation of Transcriptional Regulation 24 Transcription-Unit Components include genes, a single promoter, zero or more terminators Binding-Sites Linked to regulation frames Regulation frames Transcriptional Initiation: defines a 3-way pairing between promoter, transcription factor and binding-site Transcriptional Attenuation: defines relationship between terminator and the entity (tRNA, protein, small molecule) that regulates it. SRI International Bioinformatics Infer Anti-Microbial Drug Targets Infer drug targets as genes coding for enzymes that encode chokepoint reactions Two types of chokepoint reactions: Genome Research 14:917 2004 25 SRI International Bioinformatics Reachability Analysis of Metabolic Network Given: A PGDB for an organism A set of initial metabolites Infer: What set of products can be synthesized by the smallmolecule metabolism of the organism Can known growth medium yield known essential compounds? Romero and Karp, Pacific Symposium on Biocomputing, 2001 26 SRI International Bioinformatics Algorithm: Forward Propagation Through Production System Each reaction becomes a production rule Each metabolite in nutrient set becomes an axiom Nutrient set Products Metabolite set PGDB reaction pool “Fire” reactions Reactants 27 SRI International Bioinformatics 28 SRI International Bioinformatics Results Phase I: Forward propagation 21 initial compounds yielded only half of the 41 essential compounds for E. coli Phase II: Manually identify Bugs in EcoCyc (e.g., two objects for tryptophan) 29 A+BC+D “Bootstrap compounds” Missing initial protein substrates (e.g., ACP) B’ C Incomplete knowledge of E. coli metabolic network AB Protein synthesis not represented Phase III: Forward propagation with 11 more initial metabolites Yielded all 41 essential compounds SRI International Bioinformatics Integration with other efforts Export of BioPAX SBML Import of Enzyme DB (EC hierarchy of reactions) GO NCBI Taxonomy BioPAX (work in progress) 30 SRI International Bioinformatics Near Future Signalling pathways Validating the design Regulation Small RNAs, and other additional types Higher Eukaryotes Gene expression, Multiple splice forms Cell types, localization 31 SRI International Bioinformatics Summary Pathway/Genome Databases MetaCyc non-redundant DB of literature-derived pathways 370 organism-specific PGDBs available through SRI at BioCyc.org Computational theories of biochemical machinery Pathway Tools software Extract pathways from genomes Morph annotated genome into structured ontology Distributed curation tools for MODs Query, visualization, WWW publishing 32 SRI International Bioinformatics BioCyc and Pathway Tools Availability BioCyc.org Web site and database files freely available to all Pathway Tools freely available to non-profits Macintosh, PC/Windows, PC/Linux References Pathway Tools User’s Guide 33 Appendix A: Guide to the Pathway Tools Schema Ontology Papers section of http://biocyc.org/publications.shtml SRI International Bioinformatics Acknowledgements SRI Funding Suzanne Paley, Ron Caspi, Ingrid Keseler, Carol Fulcher, Markus Krummenacker, Alex Shearer, Tomer Altman, Joe Dale, Fred Gilham, Pallavi Kaipa sources: NIH National Center for Research Resources NIH National Institute of General Medical Sciences NIH National Human Genome Research Institute EcoCyc Collaborators Julio Collado-Vides, Robert Gunsalus, Ian Paulsen MetaCyc Collaborators Sue Rhee, Peifen Zhang, Kate Dreher Lukas Mueller, Anuradha Pujar BioCyc.org Learn more from BioCyc webinars: biocyc.org/webinar.shtml 34 SRI International Bioinformatics BioWarehouse: A Bioinformatics Database Warehouse Peter D. Karp, Tom J. Lee, Valerie Wagner BMC Bioinformatics 7:170 2006 bioinformatics.ai.sri.com/biowarehouse/ BioCyc BioPAX ENZYME CMR Genbank GO BioWarehouse Oracle (10g) or MySQL (4.1.11) Eco2DBase KEGG UniProt Taxonomy MAGE-ML 35 SRI International Bioinformatics Motivations 36 Hundreds of bioinformatics DBs exist Important problems involve queries across multiple DBs SRI International Bioinformatics Why is the Multidatabase Approach Alone Not Sufficient? 37 Multidatabase query approaches assume databases are in a queryable DBMS Most sites that do operate DBMSs do not allow remote query access because of security and loading concerns Users want to control data stability Users want to control speed of their hardware Internet bandwidth limits query throughput Users need to capture, integrate and publish locally produced data of different types Multidatabase and Warehouse approaches complementary SRI International Bioinformatics Key Challenges for BioWarehouse 38 Designing a schema that accurately captures the contents of source DBs Designing a schema that is understandable and scalable Addressing poorly-specified syntax & semantics of source DBs Balancing the preservation of source data with mapping into common semantics SRI International Bioinformatics Technical Approach Multi-platform support: Oracle (10g) and MySQL Schema support for multitude of bioinformatics datatypes Create loaders for public bioinformatics DBs Parse file format of the source DB Semantic transformations Insert DB contents into warehouse tables Provide Warehouse query access mechanisms SQL queries via ODBC, JDBC, OAA Operate public BioWarehouse server: publichouse BMC Bioinformatics 7:170 2006 39 SRI International Bioinformatics PublicHouse Server Publicly queryable BioWarehouse server operated by SRI Manages a set of biological DBs constructed using BioWarehouse Large-scale data mining using 40 CMR Open BioCyc DBs ENZYME NCBI Taxonomy UniProt Dashboard Warehouse Query Analyzer MySQL client command line See: http://bioinformatics.ai.sri.com/biowarehouse/publichouse.html Host: publichouse.sri.com Port: 3306 Database: biospiceSRI International Bioinformatics BioWarehouse Schema 41 Manages many bioinformatics datatypes simultaneously Pathways, Reactions, Chemicals Proteins, Genes, Replicons Sequences, Sequence Features Organisms, Taxonomic relationships Computations (sequence matches) Citations, Controlled vocabularies Links to external databases Gene expression datasets Protein-protein interactions datasets Flow cytometry datasets Each type of warehouse object implemented through one or more relational tables (currently ~150) SRI International Bioinformatics Warehouse Schema 42 Manages multiple datasets simultaneously Dataset = Single version of a database Version comparison Multiple software tools or experiments that require access to different versions Each dataset is a warehouse entity Every warehouse object is registered in a dataset SRI International Bioinformatics Warehouse Schema 43 Different databases storing the same biological datatypes are coerced into same warehouse tables Design of most datatypes inspired by multiple databases Representational tricks to decrease schema bloat Single space of primary keys Single set of satellite tables such as for synonyms, citations, comments, etc. SRI International Bioinformatics Acknowledgements SRI Funding Suzanne Paley, Ron Caspi, Ingrid Keseler, Carol Fulcher, Markus Krummenacker, Alex Shearer, Tomer Altman, Joe Dale, Fred Gilham, Pallavi Kaipa sources: NIH National Center for Research Resources NIH National Institute of General Medical Sciences NIH National Human Genome Research Institute EcoCyc Collaborators Julio Collado-Vides, Robert Gunsalus, Ian Paulsen MetaCyc Collaborators Sue Rhee, Peifen Zhang, Kate Dreher Lukas Mueller, Anuradha Pujar BioCyc.org Learn more from BioCyc webinars: biocyc.org/webinar.shtml 44 SRI International Bioinformatics