The EcoCyc and MetaCyc Pathway/Genome Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com http://www.ai.sri.com/pkarp/ http://EcoCyc.org/ SRI International Bioinformatics Overview Motivations and terminology Pathway/genome databases BioCyc collection EcoCyc, MetaCyc Pathway Tools software Bioinformatics Database Warehouse project SRI International Bioinformatics A E SRI International Bioinformatics What to do When Theories Become Larger than Minds can Grasp? Example: E. coli metabolic network 160 pathways involving 744 reactions and 791 substrates Example: E. coli genetic network Control by 97 transcription factors of 1174 genes in 630 transcription units Past solutions: Partition theories across multiple minds Encode theories in natural-language text We cannot compute with theories in those forms Evaluate theories for consistency with new data: microarrays Refine theories with respect to new data Compare theories describing different organisms Solution: Biological Knowledge Bases SRI International Bioinformatics Store biological knowledge and theories in computers in a declarative form Amenable to computational analysis and generative user interfaces Establish ongoing efforts to curate (maintain, refine, embellish) these knowledge bases Accepted to store data in computers, but not knowledge Such knowledge bases are an integral part of the scientific enterprise SRI International Bioinformatics Pathway Definition Chemical reactions interconvert chemical compounds A+B C+D An enzyme is a protein that accelerates chemical reactions A pathway is a linked set of reactions Often regulated as a unit A C A conceptual unit of cell’s biochemical machine E Terminology Organism Database (MOD) – DB describing genome and other information about an organism Model Pathway/Genome Database (PGDB) – MOD that combines information about Pathways, reactions, substrates Enzymes, transporters Genes, replicons Transcription factors, promoters, operons, DNA binding sites – Collection of 15 PGDBs at BioCyc.org EcoCyc, AgroCyc, YeastCyc BioCyc SRI International Bioinformatics SRI International Bioinformatics BioCyc Collection of Pathway/Genome DBs Computationally Derived Datasets: Literature-based Datasets: Agrobacterium MetaCyc Escherichia coli (EcoCyc) http://BioCyc.org/ tumefaciens Caulobacter crescentus Chlamydia trachomatis Bacillus subtilis Helicobacter pylori Haemophilus influenzae Mycobacterium tuberculosis RvH37 Mycobacterium tuberculosis CDC1551 Mycoplasma pneumonia Pseudomonas aeruginosa Saccharomyces cerevisiae Treponema pallidum Vibrio cholerae Yellow = Open Database Terminology – Pathway Tools Software SRI International Bioinformatics PathoLogic Prediction of metabolic network from genome Computational creation of new Pathway/Genome Databases Pathway/Genome Editors Distributed curation of PGDBs Distributed object database system, interactive editing tools Pathway/Genome Navigator WWW publishing of PGDBs Querying, visualization of pathways, chromosomes, operons Analysis operations Pathway visualization of gene-expression data Global comparisons of metabolic networks Bioinformatics 18:S225 2002 Pathway Tools Algorithms Query, visualization and editing tools for these datatypes: Full Metabolic Map Paint gene expression data on metabolic network; compare metabolic networks Pathways Pathway prediction Reactions Balance checker Compounds Chemical substructure comparison Enzymes, Transporters, Transcription Factors Genes: Blast search Chromosomes Operons Operon prediction SRI International Bioinformatics Model Organism Databases SRI International Bioinformatics DBs that describe the genome and other information about an organism Every sequenced organism with an active experimental community requires a MOD Integrate genome data with information about the biochemical and genetic network of the organism MODs are platforms for global analyses of an organism Interpret gene expression data in a pathway context Characterize systems properties of metabolic and genetic networks Determine consistency of metabolic and transport networks In silico prediction of essential genes EcoCyc Project – EcoCyc.org SRI International Bioinformatics E. coli Encyclopedia Model-Organism Database for E. coli Computational symbolic theory of E. coli Electronic review article for E. coli – over 3500 literature citations Tracks the evolving annotation of the E. coli genome Collaborative development via Internet Karp (SRI) -- Bioinformatics architect John Ingraham -- Advisor (SRI) Metabolic pathways Saier (UCSD) and Paulsen (TIGR)-- Transport Collado (UNAM)-- Regulation of gene expression Database content: 18,000 objects SRI International EcoCyc = E.coli Dataset + Bioinformatics Pathway/Genome Navigator Pathways: 165 Reactions: 2,760 Enzymes: 914 Transporters: 162 Proteins: 4,273 Promoters: 812 TransFac Sites: 956 Citations: 3,508 Compounds: 774 Genes: 4,393 Transcription Units: 724 Factors: 110 http://EcoCyc.org/ EcoCyc Procedures All SRI International Bioinformatics DB updates by 5 staff curators Information gathered from biomedical literature Corrections solicited from E. coli researchers Review-level database Four releases per year Available through WWW site, as data files, as downloadable application Quality assurance of data and software: Evaluate database consistency constraints Perform element balancing of reactions Run other checking programs Display every DB object SRI International Bioinformatics MetaCyc: Metabolic Encyclopedia Nonredundant metabolic pathway database Describe a representative sample of every experimentally determined metabolic pathway Literature-based DB with extensive references and commentary Pathways, reactions, enzymes, substrates 460 pathways, 1267 enzymes, 4294 reactions 172 E. coli pathways, 2735 citations Nucleic Acids Research 30:59-61 2002. Jointly developed by SRI and Carnegie Institution New focus on plant pathways Family of Pathway/Genome Databases MetaCyc SRI International Bioinformatics SRI International Bioinformatics Pathway Tools Implementation Details Allegro Common Lisp Sun and PC platforms Ocelot object database 250,000 lines of code Lisp-based WWW server at BioCyc.org Manages 15 PGDBs Pathway Tools Architecture WWW Server SRI International Bioinformatics Pathway Genome Navigator X-Windows Graphics GFP API Object Editor Pathway Editor Reaction Editor Object DBMS Oracle Ocelot Knowledge Server Architecture Frame SRI International Bioinformatics data model Classes, instances, inheritance Frames have slots that define their properties, attributes, relationships A slot has one or more values Each value can be any Lisp datatype Slotunits define metadata about slots: Domain, range, inverse Collection type, number of values, value constraints Transaction logging facility Schema evolution SRI International Bioinformatics Ocelot Storage System Architecture Persistent storage via disk files, Oracle DBMS Concurrent development: Oracle Single-user development: disk files Read-only delivery: bundle data into binary program Oracle storage DBMS is submerged within Ocelot, invisible to users Relational schema is domain independent, supports multiple KBs simultaneously Frames transferred from DBMS to Ocelot On demand By background prefetcher Memory cache Persistent disk cache to speed performance via Internet SRI International Bioinformatics The Common Lisp Programming Environment Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000) EcoCyc WWW Server SRI International Bioinformatics SRI International Bioinformatics Pathway/Genome DBs Created by External Users Plasmodium falciparum, Stanford University plasmocyc.stanford.edu Mycobacterium tuberculosis, Stanford University BioCyc.org Arabidopsis thaliana and Synechosistis, Carnegie Institution of Washington Arabidopsis.org:1555 Methanococcus janaschii, EBI Maine.ebi.ac.uk:1555 Other PGDBs in progress by 24 other users Software freely available Each PGDB owned by its creator SRI International Bioinformatics Global Consistency Checking of Biochemical Network Given: A PGDB for an organism A set of initial metabolites Infer: What set of products can be synthesized by the smallmolecule metabolism of the organism Can known growth medium yield known essential compounds? Pacific Symposium on Biocomputing p471 2001 SRI International Bioinformatics Algorithm: Forward Propagation Nutrient set Products Metabolite set PGDB reaction pool Reactants “Fire” reactions Results SRI International Bioinformatics Phase I: Forward propagation 21 initial compounds yielded only half of 38 essential compounds for E. coli Phase II: Manually identify Bugs in EcoCyc (e.g., two objects for tryptophan) Missing initial protein substrates (e.g., ACP) Missing pathways in EcoCyc Phase III: Forward propagation with 11 more initial metabolites Yielded all 38 essential compounds SRI International Bioinformatics Nutrient-Related Analysis: Validation of the EcoCyc Database Results on EcoCyc: Phase I: • Essential compounds • produced • not produced 19 19 • Total compounds • produced: (28%) • Reactions • Fired (31%) SRI International Bioinformatics Missing Essential Compounds Due To Bugs in EcoCyc Narrow conceptualization of the problem Protein substrates Incomplete biochemical knowledge SRI International Bioinformatics Nutrient-Related Analysis: Validation of the EcoCyc Database Results on EcoCyc: Phase II (After adding 11 extra metabolites): • Essential compounds • produced • not produced • Total compounds • produced: • not produced: • Reactions • Fired • Not fired 38 0 (49%) (51%) (58%) (42%) Pathway Tools Misconceptions SRI International Bioinformatics PathoLogic Does not re-annotate genomes Pathway Tools does not handle quantitative information Pathway/Genome web Editors do not work through the SRI International Bioinformatics HumanCyc: Human Metabolic Pathway Database Consortium Construct DB of human metabolic pathways using PathoLogic Link to human genome web sites Hire one curator to refine and curate with respect to literature over a 2 year period Remove false-positive predictions Insert known pathways missed by PathoLogic Add comments and citations from pathways and enzymes to the literature Add enzyme activators, inhibitors, cofactors, tissue information Available as flatfiles and with Pathway/Genome Navigator New versions to be released every 6 months Summary SRI International Bioinformatics Pathway/Genome Databases MetaCyc non-redundant DB of literature-derived pathways 14 organism-specific PGDBs available through SRI at BioCyc.org Computational theories of biochemical machinery Pathway Tools software Extract pathways from genomes Morph annotated genome into structured ontology Distributed curation tools for MODs Query, visualization, WWW publishing BioCyc and Pathway Tools Availability WWW SRI International Bioinformatics BioCyc freely available to all BioCyc.org Six BioCyc DBs openly available to all BioCyc DBs freely available to non-profits Flatfiles downloadable from BioCyc.org Binary executable: Sun UltraSparc-170 w/ 64MB memory PC, 400MHz CPU, 64MB memory, Windows-98 or newer PerlCyc API Pathway Tools freely available to non-profits SRI International Bioinformatics Acknowledgements SRI Suzanne Paley, Pedro Romero, John Pick, Cindy Krieger, Martha Arnaud EcoCyc Project Julio Collado-Vides, Ian Paulsen, Monica Riley, Milton Saier MetaCyc Project Sue Rhee, Lukas Mueller, Peifen Zhang, Chris Somerville Funding sources: NIH National Center for Research Resources NIH National Institute of General Medical Sciences NIH National Human Genome Research Institute Department of Energy Microbial Cell Project DARPA BioSpice, UPC Stanford Gary Schoolnik, Harley McAdams, Lucy Shapiro, Russ Altman, Iwei Yeh BioCyc.org