Pathway/Genome Databases and Software Tools Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com http://ecocyc.DoubleTwist.com/ecocyc/ SRI International Bioinformatics Overview Overview of bioinformatics Motivations for the EcoCyc project EcoCyc demo Description of EcoCyc database and Pathway Tools software Underlying technologies Ocelot object database GKB Editor X-windows to WWW translator Definition of Bioinformatics Computational SRI International Bioinformatics techniques for management and analysis of biological data and knowledge Methods for disseminating, archiving, interpreting, and mining scientific information SRI International Bioinformatics Motivations for Bioinformatics Growth in molecular-biology knowledge Industrialization High-throughput of biological experimentation biology Genome sequences Gene and protein expression data Protein-protein interaction data Protein 3-D structures …. SRI International Bioinformatics A E Motivations for EcoCyc -E. coli Encyclopedia Integrate SRI International Bioinformatics E. coli information dispersed in the literature New paradigm of scientific publishing Model the full metabolic network of an organism Integrate Develop Provide genomic data with functional data algorithms for computing with function a challenging domain for computerscience research SRI International Bioinformatics Definitions A chemical reaction interconverts chemical compounds A+B=C+D An enzyme is a protein that accelerates chemical reactions A pathway is a linked set of reactions A A C E conceptual unit of cell’s biochemical machine Organism-Specific Pathway/Genome Databases Layer SRI International Bioinformatics functional information above the genome Rich ontology to encode biological information with high fidelity Chromosomes, genes, operons, gene products, reactions, pathways Curated by experts for that organism Integrate literature and computational predictions Pathway Tools Software Pathway/Genome SRI International Bioinformatics Navigator WWW publishing of PGDBs Graphic depictions of pathways, chromosomes, operons Pathway visualization of gene-expression data Pathway/Genome Editors Distributed curation of genome annotations Distributed object database system Interactive editing tools PathoLogic Prediction of metabolic network from genome SRI International EcoCyc = E.coli Dataset + Bioinformatics Pathway/Genome Navigator Metabolic Network Pathways: 158 Reactions: 1,117 Compounds: 1,887 Gene Products: 4,393 Genes: 4,393 Operons: 375 http://ecocyc.DoubleTwist.com/ecocyc/ EcoCyc Collaborative development via internet Karp -- Bioinformatics architect Riley -- Metabolic pathways, signal transduction Saier and Paulsen -- Transport Collado -- Regulation of gene expression Ontology of 1000 biological classes 14,000 instances Over 2,600 registered users SRI International Bioinformatics Pathway Tools Software SRI International Bioinformatics Pathway/Genome Navigator PathoLogic Pathway Predictor Pathway/ Genome Databases Pathway/ Genome Editors SRI International Bioinformatics Creation of the Overview Graph Run layout algorithms on individual pathway graphs Automatically determine topology of pathway graph Apply associated layout algorithm (linear, circular, tidy tree) Use superpathways to create hierarchical layouts Treat each individual pathway as a single node Pathway connections are edges Run appropriate layout algorithm Manually position the resulting pathway clusters SRI International Inference of Metabolic Pathways Bioinformatics Pathway/Genome Database ANNOTATED GENOME Structured ASCII Text File MetaCyc Metabolic Network List of Gene Products Pathway List of Genes/ORFs PathoLogic Reactions DNA Sequence Gene Products Genes Reports Genomic Map Compounds Summary of H. pylori Analysis SRI International Bioinformatics For 121 E. coli pathways, what is the evidence that each pathway occurs in H. pylori? Strong evidence: 41 Medium evidence: 29 Little or no evidence: 51 31 reactions catalyzed by H. pylori but not by E. coli H. pylori has partial abilities to synthesize cofactors and amino-acids, extremely limited carbohydrate catabolism, some amino acid utilization, and a reductive citric-acid pathway Microbial Pathway/ Genome DBs SRI International Bioinformatics Literature-based Datasets: PathoLogic-based Datasets: MetaCyc Bacillus Escherichia coli subtilis Mycobacterium tuberculosis Helicobacter pylori Haemophilus influenzae Mycoplasma pneumonia Treponema pallidum Chlamydia trachomatis Saccharomyces cerevisiae SRI International Bioinformatics Pathway Tools Software Architecture Implemented in Common Lisp WWW server runs as a single Unix process with a separate thread to service each query Grasper-CL Ocelot graph manager object database GKB Editor schema-driven editor EcoCyc WWW Server SRI International Bioinformatics Pathway Tools Architecture -Development Configuration WWW Server Pathway Genome Navigator GFP API Ocelot DBMS SRI International Bioinformatics X-Windows Graphics Object Editor Pathway Editor Reaction Editor Oracle Ocelot Database System SRI International Bioinformatics Object Database Manager Persistence via filesystem or relational DBMS Demand and background faulting of objects from RDBMS Two-level object caching Extensive bioinformatics schema Stored transaction history Inspect object history Ocelot Knowledge Server Architecture Frame SRI International Bioinformatics data model Persistent storage via Disk files Oracle DBMS Optimistic concurrency-control protocol Schema evolution Logging facility The Frame Data Model Frames SRI International Bioinformatics are of two types: classes, instances Frames have slots that define their properties, attributes, relationships A slot has one or more values Each value can be any Lisp datatype Slotunits define metadata about slots: Domain, range, inverse Collection type, number of values, value constraints Inference Capabilities Inheritance Slot SRI International Bioinformatics of defaults values computed via attached procedures Maintenance Constraint of inverse relationships system Deferred evaluation Tolerant of nonconformant data Storage System Architecture Oracle SRI International Bioinformatics KBs DBMS is submerged within FRS Relational schema is domain independent, supports multiple KBs simultaneously Frames transferred from DBMS to Ocelot On demand By background prefetcher Memory cache Persistent disk cache to speed performance via Internet Frame Faulting SRI International Bioinformatics (get-slot-value gene ‘map-position) Gene present in in-memory object cache? Gene present in cache on local disk? Query Oracle DBMS Logging Oracle DBMS stores: The latest version of each frame A history of all OKBC operations applied to KB Reconstruct earlier versions of KB View history of changes to an object Update replicates Concurrency control SRI International Bioinformatics Schema Management SRI International Bioinformatics FRSs store and process class and instance information similarly Applications can query schema information as easily as they can query instances GKB Editor Browser Four SRI International Bioinformatics and editor for KBs and ontologies editing tools GKB Editor reusable with multiple FRSs All database queries via OKBC/GFP API Interoperability achieved with Ocelot, LOOM, Ontolingua All operations are schema driven http://www.ai.sri.com/~gkb/overview.html SRI International Bioinformatics Editors Taxonomy Frame editor editor Relationships Spreadsheet editor editor Results Ocelot SRI International Bioinformatics in use in the EcoCyc project for 5 years Supports collaborative development of EcoCyc by four groups in North America Distributed architecture GKB Editor in active use Supports development of 8 Pathway/Genome Databases SRI International Bioinformatics Summary Pathway/Genome Pathway Databases Tools software Extract pathways from genomes Distributed curation tools Query, visualization, WWW publishing Analysis algorithms Computer Science Results SRI International Bioinformatics Extend scalability and multiuser access for knowledge representation systems Reusable, schema-driven KB editor Hierarchical graph layout algorithms Dynamic translation from X-windows to HTML+GIF Importance of ontologies and of content: Discovery = Algorithm + Database Problem Solving Depends on Algorithms and Content Compute Time Algorithm Quality Solution Quality Database Size and Quality SRI International Bioinformatics Bioinformatics Results: Content SRI International Bioinformatics The EcoCyc database describes the full metabolic map of an organism The MetaCyc database describes over 300 metabolic pathways Ontology spans genome to pathway information Bioinformatics Results: Algorithms SRI International Bioinformatics Software environment for genome and pathway information Query and visualization Distributed database development PathoLogic algorithm predicts the metabolic network of an organism from its genome Algorithms under development for qualitative modeling of the cell Acknowledgements SRI International Bioinformatics Funding sources: NIH National Center for Research Resources Collaborators: Monica Riley, Marine Biological Laboratory Milton Saier, UC San Diego Julio Collado, UNAM Christos Ouzounis, European Bioinformatics Institute Peter D. Karp, Ph.D. http://www.ai.sri.com/pkarp/